When an AI workflow fails, the first question is usually simple: what happened?

The answer is often awful.

Someone sees a bad output, a strange customer reply, a cost spike, a wrong CRM update, a missed escalation, or a support queue full of confusing drafts. Then the team tries to reconstruct the path from memory. Which model ran? What prompt version? What context was retrieved? What tool was called? What did the tool return? Did a human approve it? Was the output edited? Did the system retry? Was this an exception or normal behavior?

If the logs cannot answer, the company is operating blind.

AI observability has to serve two audiences. Engineers need traces, latency, errors, tool-call details, and debugging context. Operators need a higher-level view: workflow health, quality signals, spend, review load, exceptions, escalation patterns, and which actions happened under which authority.

Raw traces alone are not enough. They are too noisy for most operators. Executive dashboards alone are not enough either. They hide the evidence needed to investigate.

A useful control plane links the two.

At minimum, meaningful AI activity should log the actor, workflow, model, prompt or policy version, retrieved context, tool calls, cost, latency, output, confidence or quality signal where available, approval status, human edits, final action, and linked business object. That business object matters. Logs should tie back to the ticket, account, contract, pull request, invoice, candidate, customer, or project. Otherwise the audit trail floats outside the work.

The audit log should also record boundaries. Was the action allowed automatically? Was it a dry run? Did it require approval? Which policy applied? Was any limit hit? Did the system escalate? Did it fall back to another model? Did it use retained memory?

This sounds like a lot until something breaks. Then it sounds like the bare minimum.

Observability should catch quiet failure too. Not every AI problem arrives as an incident. Sometimes acceptance rates slowly fall. Review edits creep up. Premium model use expands for no clear reason. A new prompt increases latency. An agent hits retry limits more often. Human reviewers start overriding the same recommendation. A workflow stops escalating edge cases because user language changed.

The control plane should surface those patterns before they become folklore.

Good operator views are opinionated. They do not show every token. They show the signals that decide whether the workflow is healthy:

  • volume and completion rate
  • acceptance and override rate
  • review time and edit distance
  • escalation rate and missed escalation samples
  • tool-call failure rate
  • cost per accepted output
  • model mix and routing changes
  • top failure categories
  • policy exceptions
  • drift from baseline evals

This is where audit and improvement meet. The log is not mainly for blame after an incident. It is the raw material for better prompts, better routing, better evals, better memory, better tool boundaries, and better human review design.

Privacy still matters. Logging everything forever is not maturity. Some prompts and context contain sensitive data. The control plane needs retention rules, redaction, role-based log access, and a way to preserve enough evidence without turning logs into a second data leak.

The right standard is inspectability. Authorized people should be able to understand what the system did, why it was allowed, what evidence it used, and who changed or approved the result.

If that inspectability is missing, trust becomes political. One team says the AI is working. Another says it is creating mess. Finance sees cost. Risk sees exposure. Operators see exceptions. Nobody has the shared evidence to settle it.

Audit logs make AI work legible.

That may sound unexciting. Good. Legibility is usually boring until the moment you need it. Then it is the difference between an incident review and a group chat séance.


This is part 8 of 10 in The AI Control Plane.