Automation series #6: State, Idempotency, Retries, and Queues

The least glamorous parts of automation are usually the parts that decide whether it works.

State. Idempotency. Dedupe. Retries. Queues. Exception paths.

These are not backend trivia. They are the difference between a system that can be trusted and a system that occasionally does something weird and nobody knows why.

AI does not remove this problem. It makes it more important.

Automation needs durable state

A workflow needs to remember what happened.

Not in a chat transcript. Not in a browser session. Not in a worker's temporary memory. Durable state.

At minimum, store:

event ID
source system
received time
current workflow status
input payload reference
model output reference if used
confidence and gate decision
actions attempted
external write IDs
retry count
exception reason
final outcome

If the process crashes halfway through, the system should know where to resume or how to fail safely.

Idempotency is how you survive duplicates

Real systems send duplicate events. Webhooks retry. Users double-submit. APIs time out after the action actually succeeded. Workers crash. Queues redeliver.

If your automation cannot handle duplicates, it will eventually double-send, double-charge, double-create, or double-update something.

An idempotency key is a stable identifier for an intended action. For example:

ticket_id + target_queue + classifier_version
invoice_id + paid_status + paid_at
customer_id + renewal_risk_assessment + week_start
document_id + extraction_schema_version

Before performing the action, check whether that key already completed. If yes, do not repeat the side effect. Model calls may be safely recomputed or cached, but external side effects such as sends, refunds, writes, and task creation need idempotency protection.

Exactly-once is mostly a product requirement, not an infrastructure promise

People often say they need exactly-once processing.

Usually what they need is: "The customer must not experience duplicate side effects."

That is different.

Most practical systems are at-least-once somewhere. A message may be delivered more than once. A job may retry. An API call may be ambiguous after timeout. Your job is to make the external effect safe.

Design for at-least-once delivery and idempotent side effects.

Retries need discipline

Retries are useful when failures are temporary.

They are dangerous when failures are logical.

Retry an API timeout. Do not blindly retry a malformed payload. Retry a rate limit after the right delay. Do not retry a model output that fails schema validation forever. Retry a database deadlock. Do not retry an unauthorized action until permissions change.

A retry policy should define:

which errors are retryable
max attempts
backoff strategy
timeout
dead-letter or exception path
alerting threshold
human owner

The retry policy is part of the workflow contract.

Queues protect systems and people

Queues let automation absorb work without pretending everything must happen immediately.

Use queues for:

burst control
rate limits
long-running model calls
human review
downstream system outages
exception handling
dead-letter queues for items that cannot be processed safely
replay after a fix

A queue should not become a swamp. Every queue, including the dead-letter queue, needs an owner, SLA, dashboard, and escalation path.

State and retry diagram

A simple AI automation state flow:

`text

RECEIVED

-> VALIDATED

-> MODEL_REQUESTED

-> MODEL_COMPLETED

-> OUTPUT_VALIDATED

-> GATE_DECIDED

-> AUTO_ACTION_PENDING

-> ACTION_COMPLETED

-> HUMAN_REVIEW_PENDING

-> HUMAN_APPROVED -> ACTION_COMPLETED

-> HUMAN_REJECTED -> CLOSED

-> HUMAN_ESCALATED -> ESCALATED

-> EXCEPTION_PENDING

-> RETRIED

-> FAILED

Each transition should be logged. Each side effect should have an idempotency key. Each terminal state should be clear.

Example: CRM enrichment

A RevOps team wants AI to enrich account records from website text and sales notes.

Safe design:

Store enrichment request with account ID and schema version.
Dedupe by account ID plus enrichment version.
Fetch source text deterministically.
Ask model to extract industry, company summary, ICP fit, and evidence.
Validate structured output.
If confidence is high, write suggested fields to a staging table.
If confidence is medium, send to review.
If confidence is low or schema fails, exception.
CRM update uses idempotency key and logs previous value.
Human can revert field changes.

The model does extraction. The workflow prevents chaos.

The operator's rule

If you cannot answer "What happens if this runs twice?" you are not ready to launch.

If you cannot answer "Where is this item in the workflow?" you are not ready to scale.

AI automation still needs boring infrastructure. Especially AI automation.