The Problem Space

Most agent systems are built for demos, not real work.

Bigger models and longer context windows have moved the frontier. But they haven't solved the hard part: getting agents to stay coherent over time, across tools, side effects, failures, and changing state.

The current wave of agent systems is exciting for a reason. Models can search, write code, call tools, summarize documents, and complete tasks that were out of reach even a year ago.

But once you move past the first five minutes, the cracks show.

Real work is not a one-shot prompt. It unfolds over time. It crosses tools. It accumulates state. It produces artifacts. It fails halfway through. It picks up again later. It needs to remember what changed, not just what was said.

The problem is not that the models are useless. The problem is that the runtime around them is still immature.

0%
of agent failures are runtime,
not intelligence failures
0m
average time before
context coherence degrades
0%
of agent runs produce
inspectable execution traces
0x
longer tasks run when
using structured working memory
01 — MEMORY

Long context is not the same thing as long-term memory.

Larger context windows help. But they do not automatically create durable memory. As tasks get longer, agents drown in their own transcripts. Important facts get buried under tool outputs, retries, and stale state.

Good memory is not just retrieval. It means deciding what to store, recognizing when old facts are superseded, separating working state from history, preserving provenance, and retrieving the right detail at the right time.

Transcript-based context
× Everything flattened into text
× Stale facts persist forever
× No provenance or lineage
× Context overload at scale
Structured working memory
Active state separated from history
Facts superseded explicitly
Full provenance on every fact
Selective context loading
02 — RUNTIME

Most failures are runtime failures, not intelligence failures.

When an agent breaks in production, it is rarely because the model suddenly became incapable. It is because the system around it did not manage execution well.

Databases use logs and recovery. Workflow engines use durable execution. HPC systems use checkpoint-and-restart. Agent systems are now running into the same class of problems — without the same class of solutions.

No checkpoints or resumable state
No failure recovery mid-task
No timeouts on tool calls
No artifact tracking
No idempotent tool behavior
No operator visibility into runs

A transcript is a poor operating system.

03 — STATE

When everything is flattened into text, the system loses shape.

Too many agent systems treat the chat transcript as the primary source of truth. That works for short interactions. It breaks down for serious work.

A transcript is good for communication. It is not good for structured working memory, reliable state transitions, artifact lineage, selective context loading, or recovery after interruption.

Agents need more than a conversation buffer. They need a real workspace.

× Structured working memory
× Reliable state transitions
× Artifact lineage & provenance
! Selective context loading
× Recovery after interruption
04 — SIDE EFFECTS

Tool use creates real side effects. Autonomy without structure is hidden risk.

As soon as an agent touches external systems, the stakes change. Editing code, triggering workflows, changing tickets, updating records — these are not reversible text completions.

The system needs stronger guarantees: what exactly happened, what changed, what can be retried, what must never be repeated, what needs approval, what can be rolled back.

The more capable agents become, the more important these controls get.

Risk without visibilityCritical
Retry safety on tool callsLow
Rollback coverageMinimal
Operator approval gatesRare
05 — EVALS

Benchmarks are useful, but they can hide the real difficulty.

A lot of systems look strong on narrow evaluations and still fall apart in live environments. Benchmarks compress away the hardest parts of production: changing environments, partial failure, stale memory, messy tool outputs, long-horizon execution, and cost constraints.

Even strong benchmark gains can come from ensembling, repeated attempts, or benchmark-specific tuning rather than a more trustworthy runtime. Benchmarks are one layer of truth, not the whole picture.

What benchmarks test
Clean, static environments
Short-horizon tasks
Single-attempt scoring
Isolated tool calls
What production demands
Changing, noisy environments
Hours-long execution
Consistency across runs
Chained tools with side effects
06 — OBSERVABILITY

If you cannot inspect the run, you cannot improve the system.

Teams can often see the final answer, but not the execution story: how the task was decomposed, what tools were used, what state was carried forward, which artifact grounded the decision, where the failure actually occurred.

Without traces, artifacts, and evals, improvement becomes guesswork. And once improvement becomes guesswork, complexity compounds faster than capability.

× How the task was decomposed
× What tools were actually called
× What state was carried forward
! Which artifact grounded the decision
× Where the real failure occurred
! Skill vs. luck in the outcome

The next layer is not better prompting. It is better runtime architecture.

07 — ARCHITECTURE

Once tasks become long-running and stateful, the substrate matters as much as the model.

The future of agent systems will not be won by prompt size alone. It will come from better execution environments: structured working memory, durable runs, observable state transitions, artifact-based handoffs, recovery and rollback, and evaluation built into the loop.

The model still matters. But the harness around it determines whether that intelligence translates into reliable work.

Model Intelligence
Prompt Engineering
Tool Integration
Harness Layer
Working Memory
Durable Execution
Observability & Evals
Where the real leverage is

What we are building

At Bord Zero, we are exploring a different foundation for agents. Not just smarter outputs, but systems that can:

  • Stay coherent across long tasks
  • Work with structured memory instead of transcript sprawl
  • Leave inspectable artifacts behind
  • Recover from interruption
  • Improve through evals instead of intuition

We think agents need a stronger harness before they need more hype.

Join the Research Preview