Building AI Products Series #6: Evals Are Product Requirements

Evals are often treated as technical hygiene.

That is too small.

For AI products, evals are product requirements. They define what good means, what failure means, what can ship, what must be reviewed, and what needs to improve before users are exposed to a change.

If a team cannot evaluate an AI feature, it cannot responsibly manage it.

Quality bars must be explicit

Traditional product requirements describe behavior: users can upload a file, assign a task, export a report.

AI product requirements also need quality bars: the summary must include the decision, owner, deadline, and unresolved questions; the classification must match policy categories; the answer must cite approved sources; the draft must not invent commitments.

"Good enough" cannot live only in someone's head.

A model output that feels fine in a demo may fail on the actual distribution of user inputs. Evals make that visible.

Gold sets are product assets

A gold set is a collection of representative inputs with expected outputs, ratings, or review criteria.

It should not be an afterthought created by one engineer the night before launch.

Product, design, engineering, ML, support, and operations should help define it. Support knows where users get confused. Ops knows the exception cases. Product knows the intended outcome. Engineering knows what can be tested. Domain experts know what wrong looks like.

A good gold set reflects the product promise.

Evals are not only pass/fail

Some outputs can be checked automatically. Did the answer include a citation? Did it use the required schema? Did it avoid restricted fields? Did it choose one of the allowed categories?

Other outputs need human review. Is the answer useful? Is it misleading? Did it miss the key risk? Is the tone appropriate? Would a user trust this?

Use both.

Automatic evals are scalable and good for regression. Human evals catch judgment, usefulness, and weirdness. Production monitoring catches what your test set missed.

Artifact: eval release checklist

`text

AI Eval Release Checklist

Product promise

What user outcome is this feature responsible for?
What does a good output enable the user to do?

Quality dimensions

Define pass criteria for:

correctness
completeness
source grounding
policy compliance
tone / format
latency
cost
recoverability

Gold set

Representative normal cases
Edge cases
High-risk cases
Known historical failures
Permission-sensitive cases
Adversarial or messy inputs

Regression tests

Current model/prompt/retrieval stack passes release threshold
Previous known failures do not reappear
Vendor/model changes are tested before rollout

Human review

Domain reviewers identified
Review rubric defined
Disagreement process defined
Sample size appropriate for risk

Release gates

Minimum quality threshold met
Latency and cost within limits
Support plan ready
Rollback path ready
Owner assigned after launch

Monitoring after launch

User correction rate
Escalation rate
Support tickets by failure mode
Cost per successful outcome
Drift indicators

This checklist should live in the product process, not a forgotten notebook.

Evals connect product and engineering

Evals are where vague product expectations become buildable constraints.

If product says "the answer should be trustworthy," engineering needs to know what that means. Does trustworthy mean sourced? Does it mean no unsupported claims? Does it mean confidence thresholds? Does it mean the product refuses to answer without approved documents?

Evals force the team to choose.

That choice is product strategy.

Release gates matter because models drift

AI products are exposed to drift from several directions.

The model vendor may update behavior. Retrieval content may change. Users may bring new inputs. Prompts may be edited. Data pipelines may break. A fine-tune may improve one segment and hurt another.

Without regression tests and release gates, the team learns from user pain.

That is a bad monitoring strategy.

Ownership after launch

Someone must own evaluation after launch.

Not just before launch. After launch.

Product should own the user-facing quality bar. Engineering or ML should own the technical evaluation infrastructure. Support should own the failure taxonomy and escalation patterns. Design should own interaction patterns for uncertainty and correction.

A lightweight cadence is enough to start: weekly review of eval failures, support spikes, and recent corrections; monthly review of the gold set, thresholds, cost, latency, and whether the product promise still matches real usage.

The exact structure can vary. The ownership cannot be vague.

When AI quality drops, "the model did it" is not an operating model.

Evals are also a moat

If competitors use similar models, better evals can become a durable advantage.

A team with strong domain evals can ship faster, catch regressions earlier, tune workflows more precisely, and build trust with customers. A team without evals is guessing.

The eval set becomes institutional memory: what the product has learned about good, bad, risky, useful, and unacceptable behavior.

The practical standard

Every serious AI feature should have:

a defined product quality bar
a representative gold set
regression tests for known failures
human review where judgment matters
release gates for model/prompt/retrieval changes
monitoring after launch
a named owner

If that sounds heavy, good. AI products create operational responsibility.

Ship accordingly.