The Problem Space

Most agent systems are built for demos, not real work.

Bigger models and longer context windows have moved the frontier. But they haven't solved the hard part: getting agents to stay coherent over time, across tools, side effects, failures, and changing state.

The current wave of agent systems is exciting for a reason. Models can search, write code, call tools, summarize documents, and complete tasks that were out of reach even a year ago.

But once you move past the first five minutes, the cracks show.

Real work is not a one-shot prompt. It unfolds over time. It crosses tools. It accumulates state. It produces artifacts. It fails halfway through. It picks up again later. It needs to remember what changed, not just what was said.

The problem is not that the models are useless. The problem is that the runtime around them is still immature.

of agent failures are runtime,
not intelligence failures

average time before
context coherence degrades

of agent runs produce
inspectable execution traces

longer tasks run when
using structured working memory

01 — MEMORY

Long context is not the same thing as long-term memory.

Larger context windows help. But they do not automatically create durable memory. As tasks get longer, agents drown in their own transcripts. Important facts get buried under tool outputs, retries, and stale state.

Good memory is not just retrieval. It means deciding what to store, recognizing when old facts are superseded, separating working state from history, preserving provenance, and retrieving the right detail at the right time.

Transcript-based context

× Everything flattened into text

× Stale facts persist forever

× No provenance or lineage

× Context overload at scale

Structured working memory

✓ Active state separated from history

✓ Facts superseded explicitly

✓ Full provenance on every fact

✓ Selective context loading

02 — RUNTIME

Most failures are runtime failures, not intelligence failures.

When an agent breaks in production, it is rarely because the model suddenly became incapable. It is because the system around it did not manage execution well.

Databases use logs and recovery. Workflow engines use durable execution. HPC systems use checkpoint-and-restart. Agent systems are now running into the same class of problems — without the same class of solutions.

No checkpoints or resumable state

No failure recovery mid-task

No timeouts on tool calls

No artifact tracking

No idempotent tool behavior

No operator visibility into runs

A transcript is a poor operating system.

03 — STATE

When everything is flattened into text, the system loses shape.

Too many agent systems treat the chat transcript as the primary source of truth. That works for short interactions. It breaks down for serious work.

A transcript is good for communication. It is not good for structured working memory, reliable state transitions, artifact lineage, selective context loading, or recovery after interruption.

Agents need more than a conversation buffer. They need a real workspace.

× Structured working memory

× Reliable state transitions

× Artifact lineage & provenance

! Selective context loading

× Recovery after interruption

04 — SIDE EFFECTS

Tool use creates real side effects. Autonomy without structure is hidden risk.

As soon as an agent touches external systems, the stakes change. Editing code, triggering workflows, changing tickets, updating records — these are not reversible text completions.

The system needs stronger guarantees: what exactly happened, what changed, what can be retried, what must never be repeated, what needs approval, what can be rolled back.

The more capable agents become, the more important these controls get.

Risk without visibilityCritical

Retry safety on tool callsLow

Rollback coverageMinimal

Operator approval gatesRare

05 — EVALS

Benchmarks are useful, but they can hide the real difficulty.

A lot of systems look strong on narrow evaluations and still fall apart in live environments. Benchmarks compress away the hardest parts of production: changing environments, partial failure, stale memory, messy tool outputs, long-horizon execution, and cost constraints.

Even strong benchmark gains can come from ensembling, repeated attempts, or benchmark-specific tuning rather than a more trustworthy runtime. Benchmarks are one layer of truth, not the whole picture.

What benchmarks test

• Clean, static environments

• Short-horizon tasks

• Single-attempt scoring

• Isolated tool calls

What production demands

• Changing, noisy environments

• Hours-long execution

• Consistency across runs

• Chained tools with side effects

06 — OBSERVABILITY

If you cannot inspect the run, you cannot improve the system.

Teams can often see the final answer, but not the execution story: how the task was decomposed, what tools were used, what state was carried forward, which artifact grounded the decision, where the failure actually occurred.

Without traces, artifacts, and evals, improvement becomes guesswork. And once improvement becomes guesswork, complexity compounds faster than capability.

× How the task was decomposed

× What tools were actually called

× What state was carried forward

! Which artifact grounded the decision

× Where the real failure occurred

! Skill vs. luck in the outcome

The next layer is not better prompting. It is better runtime architecture.

07 — ARCHITECTURE

Once tasks become long-running and stateful, the substrate matters as much as the model.

The future of agent systems will not be won by prompt size alone. It will come from better execution environments: structured working memory, durable runs, observable state transitions, artifact-based handoffs, recovery and rollback, and evaluation built into the loop.

The model still matters. But the harness around it determines whether that intelligence translates into reliable work.

Model Intelligence

Prompt Engineering

Tool Integration

Harness Layer

Working Memory

Durable Execution

Observability & Evals

Where the real leverage is

What we are building

At Bord Zero, we are exploring a different foundation for agents. Not just smarter outputs, but systems that can:

Stay coherent across long tasks
Work with structured memory instead of transcript sprawl
Leave inspectable artifacts behind
Recover from interruption
Improve through evals instead of intuition

We think agents need a stronger harness before they need more hype.

Join the Research Preview