Agentic AI Turns Work Into Delegation

OpenAI’s Codex usage data suggests that the agentic shift is more than a better coding assistant; it is a workflow change. People are handing off longer tasks, managing multiple agents, and saving their own procedures as reusable skills.

Source note: Drew Johnston, David Holtz, Alex Martin Richmond, Christopher Ong, Prasanna Tambe, and Aaron Chatterji. “The Shift to Agentic AI: Evidence from Codex.” OpenAI, 2026. https://cdn.openai.com/pdf/5d1e1489-21c0-43e4-9d42-f87efdbf0082/the-shift-to-agentic-ai-evidence-from-codex.pdf

Why This Paper Matters

Most AI adoption studies still rely on chatbot metrics: users, message volume, and token counts. Those numbers made sense when the software was conversational. A user asked a question, the model answered, and the exchange was easy to track.

Agentic AI changes the unit of analysis. An agent does more than answer questions. It operates across files, tools, and repositories over time. It can run for hours, produce artifacts that require review, and execute alongside other concurrent agents. Standard chat dashboards do not capture this activity.

The study evaluates this shift using data from Codex, OpenAI’s software engineering agent. The dataset covers usage across personal accounts, organizational accounts, and OpenAI’s internal staff.

The main takeaway is not adoption volume, but how work changes: as agentic systems become useful, the pattern shifts from asking questions to delegating workflows.

The Idea in Plain English

ChatGPT made AI feel like a fast colleague in a chat box. Codex functions more like a junior operator with access to a workspace. Instead of asking for explanations, the user assigns work, checks results, corrects direction, and manages multiple threads.

This distinction changes the mechanics of productivity.

In a chatbot model, productivity comes from faster answers. In an agentic model, it depends on delegation. The focus shifts to defining what work is safe to hand off, what context the agent requires, how outputs are reviewed, and which procedures should be saved for future runs.

The study treats these questions as measurable behaviors: task complexity, runtime, concurrency, workflow reuse, and skill adoption. These metrics reflect the actual operating model of agentic work better than raw message counts.

What the Researchers Tested

The researchers analyzed anonymized Codex usage data across three groups:

Individual users on personal accounts.
Organizational users on work accounts.
OpenAI employees using Codex internally.

The study used automated, privacy-preserving classifiers to label tasks, estimate complexity, identify roles, and map usage patterns. The researchers did not read raw user conversations. Where possible, they compared Codex usage with work-related ChatGPT usage.

Because Codex is more than an autocomplete utility, it can read codebases, write files, run long tasks, and execute saved skills. This allows the researchers to ask different questions than a standard software adoption study:

Are users returning to the tool, or trying it once?
Are non-engineering roles adopting it?
Are tasks becoming longer and more complex?
Are users running agents in parallel?
Are reusable skills integrated into workflows?

This makes the paper less a model evaluation and more an early map of how agentic work behaves in practice.

What They Found

The findings focus on task depth, complexity, concurrency, and workflow reuse. Codex adoption is unevenly distributed, but where users adopt it, it represents a large share of their AI-assisted work.

1. Adoption Is Narrower Than the Hype, but Deep Among Users

Active Codex users grew fivefold in the first half of 2026. While that is the headline growth rate, the more useful distinction is between broad trial and intensive use.

Fewer than 1% of active individual users ran a Codex task in the last 28 days. Among organizational users, active adoption was 17.3%, showing that the personal market remains very early.

However, once adopted, Codex usage becomes dominant. For organizational users, Codex generated 63.3% of combined Codex and ChatGPT output tokens. For individual users, the share was 16.5%.

At OpenAI, Codex generated 99.8% of combined tokens as of June 11, 2026. The researchers note that OpenAI is an outlier: employees face low friction, have high model familiarity, share internal knowledge, and have direct feedback loops with developers.

While OpenAI represents a frontier case rather than the market average, it shows how workflows evolve when organizational barriers are removed.

2. The Agentic Shift Is Moving Beyond Software Engineers

Codex began as a software engineering agent, so engineers naturally adopted first. However, the data shows adoption extending beyond developers.

Inside OpenAI, departments like legal and recruiting went from near-zero Codex usage in January 2026 to 20% by early April, and reached 75% a month later. By June, nearly all departments crossed 90% usage by token share.

This does not mean recruiters are writing production code. Instead, the boundary of what defines a “coding agent” is expanding. An agent that reads files, structures data, runs workflows, and manages context operates more like a general-purpose assistant than a code generator.

The study’s task categories reflect this, spanning code implementation, research, business workflows, document drafting, and data analysis.

Once an agent can act directly within a workspace, “software development” expands to cover standard office information work.

3. Tasks Are Getting Longer and More Delegated

Task complexity is shifting. Among individual users with at least one task, the share of tasks estimated to take over an hour rose from 35.4% in December 2025 to 70.2% in May 2026. Tasks estimated to take over eight hours rose from 2.1% to 25.6%.

This contrasts with simple requests for code snippets or text summaries. Users are delegating work that requires hours of human execution.

These complex tasks often appear at the start of a session. A user’s first prompt is twice as likely as the fourth to request a task lasting over an hour, suggesting that users open sessions by delegating the entire job.

This makes prompt volume a poor metric for AI usage; a single request can trigger hours of background work.

4. Advanced Users Manage Agents in Parallel

Most external users still run Codex sequentially. Approximately 67% of organizational users and 64% of individual users did not use concurrent sessions.

OpenAI staff work differently: only 10.7% ran a single workflow at a time, while 28.6% managed five or more concurrent agents during the study week.

Concurrency is a clear behavioral sign that a user has shifted from asking questions to managing processes. Coordinating multiple active agents resembles running a production queue rather than chatting with a model.

The runtime data matches this pattern. The median OpenAI employee ran Codex tasks for 2.5 hours on June 11, 2026. At the 99th percentile, active usage reached 71 hours of runtime per day—a volume only possible because users ran multiple workflows in parallel.

5. Skills Turn Procedures Into Infrastructure

Skill usage is a significant indicator of workflow integration. The share of tasks leveraging saved skills rose from 5.4% in March 2026 to 26.6% by June 2026.

During that week, 25.7% of active individuals and 30.4% of organizations invoked at least one skill. Among OpenAI workers, skill usage reached 96.2%.

The study defines skills as saved procedural context, such as style guides, reporting templates, and team workflows. These can be packaged alongside integrations, MCP servers, and local assets.

This shifts where organizational knowledge resides. Procedures previously documented in static wikis or shared in Slack threads become executable code.

If organizations treat skills merely as long prompts, they will miss the efficiency of structured workflow infrastructure.

Why It Happens

This shift is driven by three factors: agent capability, contextual integration, and workflow reorganization.

First, agentic systems manipulate workspaces rather than just returning text. When a model can inspect files, edit code, and run tests, users can hand off complete tasks.

Second, delegation requires context. An agent is more effective when granted access to code repositories, issue trackers, and team conventions. This context is easier to provision and standardize in organizational environments than on personal accounts.

Third, adoption requires a behavioral shift. Users must stop treating AI as a search interface and start managing it as an execution layer. This demands clear assignments, parallel execution, output review, and procedural library building.

The paper compares this to historical general-purpose technologies. Electrification did not yield its full productivity gains when factories merely replaced steam engines with electric motors. The major gains occurred when managers redesigned factory layouts around electrical power. Similarly, agents provide minimal value when forced into legacy workflows; they require processes designed around delegation.

What This Means for Builders

Agentic products require different designs and metrics than chat applications.

Builders must support the full delegation lifecycle:

Contextual assignment: Provide tools to feed files, repos, and guidelines to the agent.
Extended runtimes: Allow agents to run for hours without timing out.
State visibility: Display intermediate steps and progress in real time.
Review interfaces: Make generated code and documents easy to inspect and approve.
Concurrency control: Let users run and monitor multiple agents in parallel.
Skill management: Provide tools to version, edit, and share reusable workflows.
Auditability: Maintain clear logs and support full state rollback.

The user interface will look less like a blank chat input and more like a console for managing active jobs. The chat interface remains useful, but it is no longer the central element.

If skills are difficult to write, debug, or inspect, they devolve into prompt sprawl. If managed as versioned workflow assets, they become a compounding asset. Organizations that systematically translate their procedures into agent-readable skills will extract more value from standard models.

Builders must also change their performance indicators. Monthly active users and message counts are insufficient. Useful metrics include run duration, completed artifacts, code review pass rates, rollback frequency, concurrency levels, and end-to-end task cycle time.

The paper implies a clear warning: product teams optimizing solely for conversational loops will miss the value created when users manage agents as a system of work.

What This Means for Buyers and Operators

For buyers and operators, adoption is not the same as workflow transformation.

Purchasing agent licenses is straightforward. The challenge lies in defining which tasks are ready for delegation, what artifacts the agent must output, who performs the review, what access permissions are required, and which workflows should be turned into saved skills.

OpenAI’s data shows what is possible when operational friction is low. However, typical organizations operate under different constraints: legacy systems, strict security policies, varying technical literacy, complex access permissions, and management cultures that track visible effort.

For most teams, the practical implementation path is specific:

Target inspectable outputs: Start with workflows that produce clear, testable results.
Focus on structured domains: Choose areas with existing review gates, automated tests, or checklists.
Train for delegation: Teach employees how to write project requirements rather than simple text prompts.
Standardize common skills: Build a shared library of approved team workflows instead of allowing ad-hoc prompt engineering.
Track output metrics: Measure completed tasks and approved changes rather than raw system usage.

Operators should expect roles to change before organizational charts do. High-performing employees will transition into managers of parallel agents. Their time will shift toward scoping tasks, selecting context, reviewing outputs, and maintaining reusable procedures. This is a management skill, even for individual contributors.

This introduces a new bottleneck: when agents execute more tasks, the constraint shifts to scoping, output review, and system integration. Organizations that fail to build these verification capabilities will generate higher activity without corresponding output value.

What to Watch Next

A key question is whether external organizations can achieve OpenAI’s usage density by investing in training, context integration, and workflow redesign.

Several indicators will track this transition.

First, concurrency rates in standard enterprises. If typical users start running multiple agents in parallel, it will signal that the delegation model has moved beyond early adopters.

Second, skill governance. If saved skills become structured, version-controlled assets, they will function as a new type of internal software. If they remain undocumented prompt snippets, the workflow layer will become fragile.

Third, verification interfaces. Inspecting code diffs, data sources, and agent decisions at scale requires new review tooling. The more work agents perform, the more important the review interface becomes.

Fourth, economic metrics. As agents run longer tasks, traditional consumption metrics lose utility. The industry needs new measures focused on completed work, output quality, human verification time, and organizational efficiency.

Finally, job design. The most effective users will not be those who write the most prompts. They will be the operators who can decompose projects, coordinate parallel runs, and convert tacit team processes into reusable agent context.

Limitations and Caveats

While this study provides empirical data, it is not a direct measure of universal productivity gains.

First, Codex is a specialized tool tailored for software engineering. Its usage patterns may not translate to other types of knowledge work or different agent designs.

Second, OpenAI’s internal staff is an unrepresentative sample. Employees have deep familiarity with the models, minimal infrastructure friction, and direct access to product developers.

Third, the study is observational. It describes how users interact with Codex but does not measure net productivity changes or task quality.

Fourth, token volume and task runtimes are proxy metrics. They signal intensity and complexity but do not directly measure quality, business value, or actual human hours saved.

Fifth, the automated classifiers used to categorize tasks introduce classification errors. The broad patterns are reliable, but individual percentage splits should be treated as estimates.

The core takeaway is not that agents have already restructured the economy, but that agentic work has distinct behavioral signatures: multi-hour runtimes, concurrent execution, reusable skills, and a shift from chat metrics to workflow metrics.

Source

Drew Johnston, David Holtz, Alex Martin Richmond, Christopher Ong, Prasanna Tambe, and Aaron Chatterji. (2026). The Shift to Agentic AI: Evidence from Codex. OpenAI. Available at: https://cdn.openai.com/pdf/5d1e1489-21c0-43e4-9d42-f87efdbf0082/the-shift-to-agentic-ai-evidence-from-codex.pdf