The paper’s practical point: useful AI systems are increasingly defined by the machinery around the model, especially the machinery that decides what information the model gets to see.
Source note: Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu. “A Survey of Context Engineering for Large Language Models.” arXiv:2507.13334, 2025-07-21. https://arxiv.org/abs/2507.13334
Why This Paper Matters
For a while, “prompt engineering” carried too much weight.
It had to describe a lot of different work: writing better instructions, choosing examples, adding retrieved documents, structuring tool calls, preserving memory, routing between agents, managing long context windows, and keeping the model from losing track of the task.
That label is now too small.
This survey argues for a broader term: context engineering. The phrase is useful because it moves attention away from the sentence-level craft of prompting and toward the system-level problem underneath it. The important question is not only “What words should we put in the prompt?” It is “What information should reach the model, in what structure, at what moment, and under what cost, latency, privacy, and reliability constraints?”
That is the real design surface for modern AI products.
An agent does not just receive a user message. It receives system instructions, tool definitions, retrieved documents, prior conversation, task state, memory, examples, intermediate reasoning artifacts, policy constraints, structured data, and sometimes messages from other agents. All of that is context. All of it competes for attention. All of it can help or hurt.
This paper matters because it names the shift clearly. The next phase of AI systems will not be won only by bigger models or prettier prompts. It will be won by teams that can manage context as a production resource.
The Idea in Plain English
The simplest way to read the paper is this: context is the runtime environment of the model.
The model is the engine, but the engine only sees the world through the context you assemble for it. If the context is stale, noisy, incomplete, badly ordered, too long, too short, or missing the right tool definition, the model will behave badly even if the underlying model is strong.
Prompt engineering treats the input as a mostly static string. Context engineering treats the input as a dynamically assembled package.
That package can include instructions, retrieved knowledge, memory, tool schemas, user state, world state, examples, plans, scratchpads, citations, and the user’s immediate request. The job is to select and arrange those pieces so the model has the highest chance of producing the right output without wasting tokens or introducing risk.
In product terms, context engineering is the work between the user interface and the model call.
In infrastructure terms, it is retrieval, ranking, compression, memory, cache management, tool exposure, state tracking, orchestration, and evaluation.
In operating terms, it is the discipline of deciding what the model needs to know right now, and what should stay outside the context window until it is actually needed.
What the Researchers Tested
This is a survey paper, not a new model or benchmark.
The authors review more than 1,400 research papers and organize the field into a taxonomy. Their main contribution is a map of the territory.
The taxonomy has two levels.
The first level is foundational components:
Context retrieval and generation covers how systems create or fetch useful material. This includes prompt engineering, retrieval-augmented generation, external knowledge retrieval, and dynamic context assembly.
Context processing covers what happens once information is available. This includes long-context processing, self-refinement, multimodal context, structured context, and relational data such as knowledge graphs.
Context management covers the constraints of the context window itself. This includes memory hierarchies, storage architectures, context compression, KV-cache management, and ways to prevent useful information from being lost or drowned out.
The second level is system implementations:
RAG systems combine retrieval and generation.
Memory systems preserve useful information across interactions.
Tool-integrated reasoning gives models structured access to external actions and environments.
Multi-agent systems coordinate context across multiple model-driven actors.
The paper’s useful move is that it treats those areas as related parts of one discipline rather than separate research islands.
What They Found
Context is no longer one string
The paper formalizes context as a structured set of components rather than a single prompt.
That sounds academic, but the distinction matters. A real AI system assembles context from many places: the user query, system rules, retrieved documents, memory, tools, task state, and external data. Each component has a source, a format, a cost, a reliability profile, and a reason for being included.
Once you see context this way, prompt writing becomes only one part of the job. The harder work is building the functions that retrieve, select, compress, format, and order those components.
That is where the system either becomes robust or starts to rot.
Longer context is not automatically better
The paper is blunt about the limits of long context.
Longer context windows are useful, but they do not erase the core problem. More context can mean more latency, more cost, more KV-cache pressure, and more noise. Traditional transformer attention also scales poorly with sequence length, which makes very long inputs expensive to process.
There is also a behavioral problem. Models can struggle to use information placed in the middle of long contexts, the familiar lost-in-the-middle effect. The paper also discusses context overflow, where useful information falls out of the window, and context collapse, where accumulated memory makes the model less able to distinguish which conversation, task, or state actually matters.
This is the mistake many teams make with AI systems: they treat the context window like storage.
It is not storage. It is working memory. Working memory needs curation.
Memory is infrastructure, not chat history
The survey puts memory systems in the center of context engineering.
That is the right place for them. Memory is not just “append the previous conversation.” Good memory systems decide what to store, what to summarize, what to retrieve, what to forget, and what to keep outside the hot path.
The paper discusses OS-inspired memory hierarchies, external storage, paging-like approaches, forgetting mechanisms, memory strength, and retrieval-oriented architectures. The details vary, but the pattern is consistent: the model’s native context window is too limited and too expensive to serve as the whole memory system.
For builders, this changes the architecture. A serious agent needs a memory layer with policies, not a bigger chat transcript.
Tools and agents are context problems too
Tool use is often described as an action problem. The paper frames it as a context problem as well.
A model can only call a tool correctly if the relevant tool definition, schema, constraints, examples, and state are present in the right form. Too few tool details and the model cannot use the tool. Too many tool details and the model’s attention gets polluted.
The same applies to multi-agent systems. Communication protocols, coordination state, shared task history, local agent state, and handoff artifacts all become context. A multi-agent system fails when the wrong information moves between agents, or when too much information moves without structure.
That makes orchestration less mystical. A lot of agent engineering is context routing.
Evaluation has not caught up
The paper argues that traditional evaluation is weak for context-engineered systems.
That is because these systems are not single-turn text generators. They combine retrieval, tools, memory, compression, state, planning, and sometimes multiple agents. They adapt over time. They fail through interaction effects.
A benchmark that checks whether the final answer is correct may miss the real failure: the retrieval step fetched the wrong document, the memory layer stored a stale fact, the tool schema was ambiguous, the model ignored a middle-of-context constraint, or one agent passed an incomplete handoff to another.
The implication is uncomfortable but true. Context engineering needs component-level tests and system-level tests. You need to test retrieval, ranking, compression, memory recall, tool selection, state transitions, and end-to-end outcomes.
Why It Happens
Context engineering is becoming important because LLM products are moving from isolated responses to operating loops.
A chatbot can survive with a good instruction prompt and a short conversation history. An agent cannot.
An agent has to remember prior work, inspect files, call tools, revise plans, use external sources, follow policies, coordinate with other agents, recover from partial failures, and explain what happened. Each of those capabilities adds context. Each new context source introduces a selection problem.
The default response is to include more.
More documents. More examples. More memory. More tool descriptions. More rules. More previous messages. More chain-of-thought-like scaffolding. More agent handoff notes.
At small scale, that can help. At production scale, it creates a different failure mode: the model is surrounded by plausible but competing material.
The paper’s core mechanism is information logistics. The model’s output depends on the input package. If the package contains the wrong information, or the right information in the wrong structure, the model’s latent capability is not enough.
That is why context engineering is a systems layer. It is the discipline that decides what gets into the package.
What This Means for Builders
Builders should stop treating context as a blob.
A context window should be assembled from typed components: instruction, policy, user request, retrieved evidence, memory, tool interface, current state, examples, and output constraints. Each component should have a budget, owner, loading rule, and test.
The most practical questions are boring:
What always loads?
What loads only when routed?
What can be summarized?
What must remain verbatim?
What can be retrieved lazily?
What must be excluded for privacy or safety?
What is the maximum token budget for each component?
What test proves that the system still works after a context change?
That is the engineering work.
The paper also implies that context assembly should become more automated. Today, many systems rely on fixed heuristics: top-k retrieval, static memory summaries, hand-written tool lists, and manual prompt templates. The future version will be more dynamic. It will learn which context components improve which tasks, under which constraints.
But teams should be careful. “Intelligent context assembly” can become opaque quickly. The more dynamic the system becomes, the more it needs observability: what was included, what was excluded, why, and how that affected the result.
The useful builder posture is simple: make context inspectable.
What This Means for Buyers and Operators
Buyers should ask vendors about context, not just models.
The model matters, but a weak context layer can waste a strong model. A strong context layer can also make a smaller or cheaper model good enough for many workflows.
The questions are concrete:
How does the system decide which documents, memories, tools, and policies enter the context?
Can an operator inspect the context used for a specific answer or action?
Does the product separate long-term memory from short-term task state?
How are stale memories retired?
How does the system handle prompt injection in retrieved content?
What happens when the context window fills up?
Does the system measure cost and latency by context component?
Can the organization set policies for what cannot be loaded into a model call?
How are tool definitions versioned and tested?
For operators, the paper is a warning against treating AI implementation as “connect model to data.”
The connection is the easy part. The hard part is deciding which data belongs in which interaction. Dumping a wiki, CRM, Slack history, support tickets, and tool catalog into an agent’s reach does not create intelligence. It creates a context-governance problem.
Good AI operations will look more like information architecture, observability, and workflow design than prompt tinkering.
What to Watch Next
The field should watch for context engineering to become explicit product infrastructure.
The useful signs will be practical: context budgets, context traces, retrieval audits, memory lifecycle controls, tool-schema tests, prompt-injection filters, context-diff debugging, and per-component cost reporting.
The field should also watch whether context assembly moves from fixed recipes to learned policies. If a system can learn that certain tasks need memory but not broad retrieval, or need a tool schema but not examples, it can become both cheaper and more reliable.
Multi-agent systems are another pressure point. As agents coordinate, the central question becomes what state each agent should know. Too little state and coordination fails. Too much shared state and the system becomes slow, expensive, and confused.
Finally, watch the comprehension-generation gap the authors highlight. The paper argues that current systems are increasingly good at understanding complex contexts but still struggle to generate equally sophisticated long-form outputs. That gap matters for research agents, report generation, legal analysis, and enterprise workflows where the answer is not a fact but a structured artifact.
Limitations and Caveats
This is a broad survey. Its value is the map, not a new experimental result.
Because it covers more than 1,400 papers, it necessarily compresses many debates. Readers should not treat every subfield summary as the final word. RAG, memory, tool use, multi-agent orchestration, long-context models, and evaluation each have their own unresolved arguments.
The term “context engineering” is also at risk of becoming vague. If it just means “anything around the model,” it will become another inflated label. The useful version is narrower: explicit control over the information payload passed to the model, including retrieval, selection, formatting, memory, state, tools, compression, and evaluation.
There is also a practical gap between taxonomy and operating system. The paper explains what belongs in the field, but teams still need concrete implementation patterns: schemas, budgets, tests, observability, ownership, and incident review when context failures cause bad outputs.
The central conclusion still holds. As AI products become agents and workflows, the important work shifts from writing one better prompt to managing a whole information supply chain. Context engineering is a good name for that work, provided teams make it operational.
Source
Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu. (2025). A Survey of Context Engineering for Large Language Models. arXiv preprint arXiv:2507.13334. Available at: https://arxiv.org/abs/2507.13334