The point of automation architecture is not to produce diagrams.
The point is to make better decisions before the workflow is in production, touching customers, writing to systems, and creating trust problems.
Use this worksheet before building or expanding an AI automation workflow. It is deliberately practical. If a team cannot answer these questions, the workflow is not ready.
1. Define the workflow
What is the workflow called?
What business outcome should it produce?
What starts it?
What is the source system?
What is the final desired state?
Who owns the workflow after launch?
What is explicitly out of scope?
If the answer is "the AI handles it," stop. That is not a workflow definition.
2. Classify the work
Break the workflow into steps.
For each step, mark the best owner:
| Step | Code | Model | Human | Reason |
|---|---:|---:|---:|---|
| Receive event | Yes | No | No | Deterministic |
| Validate input | Yes | No | No | Schema |
| Interpret messy text | No | Yes | Review edge cases | Ambiguous language |
| Apply policy threshold | Yes | No | Exceptions | Hard rule |
| Draft response | Support | Yes | Review if sensitive | Language generation |
| Send response | Yes | No | Yes if high risk | External action |
This is the automation boundary map.
3. Identify risk and reversibility
For each action, ask:
- What can go wrong?
- Who is affected?
- Is the action reversible?
- How quickly would we detect a mistake?
- What is the worst plausible outcome?
- Does this require approval?
Use this table:
| Action | Risk | Reversible? | Control |
|---|---|---:|---|
| Internal note | Low | Yes | Auto |
| CRM field update | Medium | Usually | Auto with audit and revert |
| Customer email draft | Medium | Yes | Human send |
| Customer email send | Medium/high | No | Confidence and policy gate |
| Refund | High | Sometimes | Approval threshold |
| Access change | High | Sometimes | Deterministic rule plus approval for exceptions |
4. Define model jobs narrowly
For every AI step, write:
- job type: classify, extract, draft, summarize, route, recommend
- input contract
- output schema
- allowed categories
- confidence expectations
- rationale requirement
- review triggers
- prompt version
- model version
- evaluation set
Example:
`json
{
"job": "classify_support_ticket",
"allowed_categories": ["billing", "product", "bug", "legal", "security", "unknown"],
"output_schema": {
"category": "string",
"confidence": "number",
"rationale": "string",
"requires_review": "boolean"
},
"review_triggers": ["confidence < 0.85", "category in legal/security", "unknown"]
}
`
If the model job cannot be described this way, it is probably too broad.
Also define cost and latency expectations:
- expected monthly volume
- maximum acceptable model cost per item or per month
- latency budget for normal and peak periods
- fallback behavior if the model provider is slow or unavailable
- when deterministic pre-filters should skip the model call
5. Set decision gates
Define what happens after the AI step.
| Condition | Action |
|---|---|
| Output fails schema | Retry once, then exception |
| Confidence high and low risk | Auto proceed |
| Confidence medium | Human review or sampled review |
| Confidence low | Exception queue |
| Sensitive category | Human approval |
| Irreversible action | Human approval or staged action |
Do not collect confidence if the workflow ignores it.
6. Design state and retries
Answer:
- What is the workflow ID?
- What is the source event ID?
- What is the idempotency key?
- What states can the workflow enter?
- Which errors are retryable?
- What is the max retry count?
- Where do exceptions go?
- Who owns the exception queue?
- What is the SLA?
Minimum states:
`text
RECEIVED -> VALIDATED -> AI_COMPLETED -> GATE_DECIDED -> ACTION_COMPLETED
-> HUMAN_REVIEW_PENDING
-> EXCEPTION_PENDING
-> FAILED
`
The exact names matter less than having explicit states.
7. Define observability
Capture:
- workflow ID
- event ID
- input reference
- schema version
- prompt version
- model version
- model output reference
- parsed output
- confidence score
- validation result
- gate decision
- human review decision
- external side effect IDs
- retry history
- final status
Ask one blunt question: if this fails next month, can we explain what happened in ten minutes?
8. Build evaluation loops
Define:
- gold set examples
- sampled review rate
- failure taxonomy
- regression tests
- drift checks
- review cadence
- owner for prompt/policy changes
Evaluation is not a research project. It is maintenance for a production workflow.
9. Define security controls
Confirm:
- secrets are excluded from prompts and logs
- credentials are scoped to the workflow, tenant, and action
- retrieved documents and tool outputs are treated as untrusted input
- prompt-injection controls exist for agent/tool workflows
- external actions require approval, staging, or rate limits based on risk
- logs are redacted where full content is unnecessary
10. Assign ownership after launch
Every automation needs an owner.
Not just a builder. An owner.
Ownership includes:
- monitoring dashboards
- reviewing exceptions
- maintaining policies
- approving prompt/model changes
- reviewing failure patterns
- communicating incidents
- naming an incident owner when something breaks
- weekly review for new workflows, then monthly review once stable
- a retirement trigger, such as sustained low usage, high exception rate, or policy mismatch
- retiring or redesigning decayed workflows
If nobody owns it, the workflow will decay silently.
11. Decide the launch shape
Choose one:
- shadow mode: automation recommends, humans act
- draft mode: automation creates drafts, humans send
- assisted mode: automation acts on low-risk cases, reviews the rest
- full automation: only for low-risk, well-observed, reversible work
Most serious AI automation should not start at full automation. Earn autonomy.
Final operator check
Before launch, confirm:
- deterministic parts are handled by code
- AI steps are bounded
- humans own accountability points
- confidence gates change behavior
- state is durable
- side effects are idempotent
- retries are controlled
- logs can reconstruct decisions
- security follows least privilege
- prompts, logs, and tool calls exclude unnecessary sensitive data
- external actions are gated
- evaluation loops exist
- owner is named
If this feels like too much, that is the signal.
The work is not just making AI do something once.
The work is making the workflow reliable enough that people keep trusting it after the first weird edge case.
