Evals are easy to admire and hard to operationalize.
A team builds a test set. The model performs well. A demo looks convincing. Everyone agrees quality matters. Then the workflow changes, prompts drift, a model is upgraded, retrieval starts pulling different context, a tool schema changes, and nobody knows whether the system is still good enough.
That is why evals belong in the control plane.
An eval is not a spreadsheet of examples that made the launch deck look responsible. It is a release gate. It should decide whether an AI system, prompt, model, routing rule, tool change, or context source is allowed to move forward.
This matters because AI systems fail in ways normal software tests do not fully catch. The code may run. The API may respond. The workflow may complete. The output may still be wrong, low-confidence, subtly off-policy, too expensive, too verbose, too risky, or likely to create review burden downstream.
A useful eval system starts with the work. What does accepted output look like? What errors are tolerable? What errors are serious? Who reviews? What evidence should the model use? Which edge cases matter? What is the cost of a false positive, false negative, bad recommendation, bad action, or missed escalation?
The answer should produce more than one score. Operators need a quality profile.
For a support drafting workflow, that profile might include factual accuracy, policy compliance, tone, completeness, escalation detection, customer-specific constraints, and review edit distance. For account research, it might include source grounding, relevance, freshness, hallucination rate, and usefulness to the seller. For code, it might include tests, security checks, maintainability, and reviewer acceptance.
The control plane should connect those evals to releases. If a prompt change reduces escalation detection, it should not ship quietly. If a cheaper model passes extraction but fails edge cases, route only the safe class of work to it. If retrieval changes improve coverage but increase hallucinated citations, pause the rollout. If a model upgrade increases quality but doubles cost, make that trade visible.
Evals also need ownership. Someone must own the gold set, the grading rules, the failure taxonomy, and the release threshold. Otherwise evals decay into artifacts nobody trusts. The owner does not have to be a central AI committee. In many cases, the workflow owner should own the eval with help from product, engineering, risk, or operations.
The best eval sets are living assets. They include recent failures, hard cases, policy changes, high-value scenarios, and examples where humans disagreed. If the eval set only contains easy launch examples, it is a decoration.
Release gates should be proportionate. A low-risk internal summarizer may need a light regression check and sampling. A customer-facing billing assistant needs tighter gates. An agent with write access to production systems needs serious pre-release testing, dry runs, and rollback plans.
The control plane should also track eval results over time. Quality is not a one-time property. It drifts with model updates, data changes, user behavior, prompt edits, new products, seasonal patterns, and workflow changes. A system that was safe last quarter may be mediocre now.
This is where evals connect to observability. Production logs should feed back into evals. Human corrections should create new test cases. Escalations should update failure categories. Cost and latency should sit beside quality so teams understand the full trade.
The discipline is less glamorous than the demo, but far more useful.
Before an AI system touches real work, ask: what would make us block this release? If the team cannot answer, the eval is probably not a gate. It is a hope.
AI systems need release gates because operators need a way to say "good enough for this use" with evidence.
Without that, every deployment is an experiment on the business.
This is part 7 of 10 in The AI Control Plane.
