Research Explainers May 28, 2026 7 min read

Agent Logs Should Be the System of Record

The paper’s practical point is simple: if agents are going to do long-running work, their logs cannot stay as debugging exhaust. The log has to become the agent’s source of truth.

Source note: Yohei Nakajima. “The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems.” arXiv:2605.21997, submitted May 21, 2026. https://arxiv.org/abs/2605.21997

Why This Paper Matters

Most agent systems are built around the language model. The loop starts with messages, tools get attached, memory gets layered on, guardrails accumulate, and observability arrives later when the system starts behaving in ways people need to explain.

That order works for demos. It is much weaker for systems where the agent’s work has to be audited, replayed, forked, compared, or trusted after the fact.

This paper proposes a different center of gravity. ActiveGraph treats the append-only event log as the source of truth. The agent’s graph state is a deterministic projection of that log. Behaviors react to changes in the graph and emit new events back into the log. In that design, the system does not merely record what the agent did. The record is the thing the agent is made from.

The paper does not claim higher task accuracy. That matters. Its contribution is architectural: a way to make agent runs replayable, forkable, and traceable through the same primitive.

The Idea in Plain English

Think of a conventional agent as a worker with a notebook. The real work happens in the worker’s head and tools. The notebook is useful later, but it is incomplete and secondary.

ActiveGraph flips that. The notebook is the system of record. Every goal, tool call, model request, model response, rule change, object, relation, and output is written as an event. The working graph is rebuilt by replaying those events. Behaviors watch the graph for patterns and write more events when their conditions match.

There is no top-level workflow script pushing state from step to step. Coordination happens through the shared graph. A planner behavior may create a company object. A question generator sees the company and emits research questions. A researcher sees unanswered questions and adds claims, evidence, and documents. A memo writer sees enough supported claims and synthesizes a memo.

The control flow is still real. It has just moved from an external orchestration script into data-driven reactions whose causes are logged.

What the Researchers Tested

The paper presents ActiveGraph as a systems architecture and includes a reproducible investment-diligence demo.

The demo starts from company names and produces diligence artifacts. According to the paper, the bundled quickstart runs on three companies against recorded fixtures, with no API key required. The reported run produces 671 events, 93 objects, 76 relations, 103 model calls, and 48 tool calls. The graph includes companies, questions, documents, claims, evidence items, a contradiction, risks, and memos.

The important part is not the memo quality. The important part is that every artifact in the memo can be traced back through the event log: which behavior created it, which event caused that behavior to run, which model request produced the text, which question it addressed, which document it came from, and which evidence supports it.

That is the paper’s evidence anchor. It demonstrates the mechanism, not a benchmark win.

What They Found

Replay becomes a runtime property

Because the graph is a fold over the event log, a run can be reconstructed from its history. The paper distinguishes first execution from replay. First execution can call nondeterministic models and tools. Replay does not pretend those calls are reproducible. It serves the already-recorded responses from a content-addressed cache.

Model responses are keyed by a hash of the full request, including system message, user messages, model identifier, tool definitions, and output schema. Tool responses are keyed by tool name and arguments. During replay, matching requests get the stored response instead of making a new call.

That gives the system a real answer to a painful production question: can we reconstruct what happened without re-running the agent and hoping the model does the same thing again?

Forking becomes cheap enough to use

The paper’s strongest operational idea is the fork. A run can branch at a chosen event. The fork inherits the parent log up to that point, replays the shared prefix from cache, and only pays for live execution after the cutoff.

That makes counterfactual work much more practical. If a 200-step run changes a setting at step 150, the first 149 steps do not need to be re-executed. The fork can then be structurally diffed against the parent: which objects changed, which relations changed, which patches changed, and what downstream work moved as a result.

For agent operations, this is more than a developer convenience. It is a way to test alternative instructions, tools, policies, or evidence paths without losing lineage.

Lineage becomes a first-class deliverable

In the diligence example, a memo statement is a claim object with provenance, not a loose paragraph in an output blob. It links back to the behavior that created it, the triggering event, the model request, the question it addressed, the document it came from, and the evidence that supports it.

That matters most in domains where the answer is not enough. Due diligence, research, compliance, science, finance, legal work, and enterprise workflow automation all need to know why a claim exists and how it got into the system.

The paper’s phrase “the log is the agent” lands here. If the log is complete enough, the agent’s work can be inspected as a causal structure rather than a transcript with some metadata attached.

Why It Happens

The design combines three older ideas in a way that fits agent systems unusually well.

First, event sourcing: state is derived from an immutable sequence of events. This is common in data systems, but agent frameworks often treat logs as observability artifacts rather than state.

Second, reactive dataflow: derived values update when inputs change. In ActiveGraph, behaviors subscribe to event and graph-shape patterns. A behavior can fire when a claim addresses an unanswered question or when a typed relation appears between two objects.

Third, blackboard-style coordination: independent components read from and write to a shared knowledge structure instead of calling each other directly. The paper argues that LLMs make this older model more useful because behaviors can now be flexible model-backed routines rather than brittle hand-coded expert rules.

The graph matters because a flat log alone is not enough. The log gives reproducibility. The graph gives behaviors something expressive to watch and compare. Together they make it possible to react, replay, fork, and diff agent work.

What This Means for Builders

The builder lesson is not “use this framework.” It is broader: production agent systems need a real system of record.

If an agent is short-lived, low-risk, and disposable, a transcript plus traces may be fine. If it is long-running, expensive, self-modifying, compliance-sensitive, or expected to support human review, the architecture has to answer harder questions.

Can a run be replayed exactly from stored history? Can a human inspect why a fact entered context? Can a branch test a different instruction without re-paying for the whole prefix? Can two runs be diffed structurally? Can a model call be tied to the artifact it produced? Can a rule change be audited and rolled back?

ActiveGraph gives one concrete design for those answers. Builders do not have to copy the full design to take the paper seriously. They do need to stop treating logs, traces, memory, and state as separate afterthoughts.

What This Means for Buyers and Operators

For buyers, the useful question is not whether a vendor has “agent memory.” Memory can mean a vector store, a summary, a temporal graph, or a pile of chat history. Those are not the same as replayable operational history.

The stronger question is: can the system explain a result from goal to artifact? Can it replay a run without making fresh model calls? Can it fork from a point in the history? Can it show what changed between a parent run and a counterfactual run? Can it prove which tool calls happened and which model responses were used?

Those questions will separate demo agents from operational agents. In regulated, high-value, or long-running settings, a beautiful answer with weak lineage is still operationally fragile.

The paper also gives operators a useful way to think about self-improving agents. Self-modification is dangerous when it is invisible. If rule changes, prompt changes, and tool changes are themselves logged events, then they can be replayed, forked, compared, and rolled back. That does not solve self-improvement, but it makes the problem less mystical and more operational.

What to Watch Next

The field should watch whether log-first agent architectures show measurable advantages in real deployments. The paper demonstrates mechanisms, not downstream performance.

Builders should watch replay cost. Long-lived agents can produce enormous logs. The paper notes that million-event runs are replayed in full today, and that checkpointing or compaction is future work.

Operators should watch side-effecting tools. Recording a tool response makes replay deterministic, but it does not undo the first real-world mutation. Sending an email, updating a CRM record, or changing production infrastructure still needs policy controls outside replay.

Researchers should watch distributed ordering. A single append-only log gives clean ordering inside one run. Multi-agent systems with concurrent writers over a shared graph raise harder questions that the paper does not resolve.

Limitations and Caveats

This is a single-author arXiv systems paper, not a broad empirical evaluation. The paper is explicit about that. It does not show that ActiveGraph improves task accuracy, speed, cost, or user outcomes compared with conventional agent frameworks.

The determinism contract is also a real burden. Behaviors must avoid unmanaged randomness, wall-clock reads, fresh UUIDs, uncontrolled I/O, and mutable global state. The system catches violations dynamically during replay, not statically before they ship.

The storage tradeoff is real too. Model and tool responses are recorded so replay can be deterministic. That is exactly what makes the system useful, but it means storage grows with run size.

Finally, the architecture moves coordination rather than eliminating it. A reactive graph can still loop, diverge, or trigger too much work. The runtime uses budgets for events, behavior calls, model calls, patches, recursion depth, wall-clock time, and cost. Budgets are necessary, but they are not a proof of termination.

Source

Nakajima, Yohei. (2026). The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems. arXiv preprint arXiv:2605.21997. Available at: https://arxiv.org/abs/2605.21997

Research Browse Research & Deep Dives

Move through market maps, company deep dives, cross-profile patterns, papers, reports, and technical explainers.

Start Here Find the best entry point

Use the site map to choose a path through AI, operations, strategy, profiles, and series.

Topic Explore AI systems

Read essays on AI adoption, agents, business systems, and the changing shape of work.