The paper’s practical point: agent skills are becoming real software artifacts, and many of them are already bloated enough to waste money, context, and model attention.

Source note: Yudong Gao, Zongjie Li, Yuanyuanyuan, Zimo Ji, Pingchuan Ma, Shuai Wang. “SkillReducer: Optimizing LLM Agent Skills for Token Efficiency.” arXiv:2603.29919, 2026-03-31. https://arxiv.org/abs/2603.29919

Why This Paper Matters

Agent skills are supposed to make agents more efficient.

Instead of explaining the same workflow every time, the operator gives the agent a reusable skill: when to use it, what steps to follow, which files to read, which commands to run, and which examples matter. In theory, this should compress experience into a durable procedural artifact.

This paper argues that many skills do the opposite.

They are too verbose. They mix rules, background, examples, templates, and reference material in one always-loaded file. They sometimes lack a usable routing description, so the agent cannot even know when to invoke them. They can inject thousands of tokens into the context window before the task has earned that much attention.

That matters because context is not free. It costs money. It consumes scarce input budget. More importantly, it changes what the model pays attention to. A skill that loads too much material can make the agent worse, not better.

The paper gives this problem a useful name: skill bloat.

That is the right framing. Skills are no longer just prompt snippets. They are authored, shared, versioned, installed, and invoked like software packages. If they become software artifacts, they inherit software problems: bloat, dead weight, weak interfaces, implicit dependencies, stale components, and the need for build-time optimization.

The Idea in Plain English

The simplest version is this: a skill should be a sharp operating procedure, not a knowledge dump.

Most agent skill systems have two important layers.

The first is the routing description. This is the short text the runtime uses to decide whether a skill applies to the user’s request. If the description is missing, vague, or stuffed with irrelevant trigger phrases, the agent may fail to load the right skill or waste routing attention on the wrong one.

The second is the body. This is the instruction document that gets loaded when the skill is selected. The body should contain the rules the agent needs to act correctly. In practice, it often contains everything the author had lying around: rationale, tutorials, examples, boilerplate, templates, checklists, edge cases, reference docs, and sometimes repeated material.

SkillReducer treats this as a separation-of-concerns problem.

Keep the routing description compact and discriminative. Keep the always-loaded body focused on actionable core rules. Move examples, background, and templates into on-demand reference modules that the agent can read when the task actually needs them.

That sounds mundane, but it is exactly the kind of mundane discipline that turns a prompt pile into an operating system.

What the Researchers Tested

The paper has two parts.

First, the authors study the current skill ecosystem. They analyze 55,315 public GitHub skills, 100 curated SkillHub skills, and 620 community-shared skills. They look at descriptions, bodies, references, scripts, token counts, and the mix of content inside skill files.

Second, they build and evaluate SkillReducer, a two-stage optimization framework.

Stage 1 optimizes routing descriptions. If a description is missing or too short, the system generates one from the body. If it is verbose, the system compresses it. The interesting trick is that the authors use delta debugging: split the description into semantic clauses, remove pieces, and test whether the smaller version still routes correctly. The routing test includes similar distractor skills and an adversarial “shadow” skill designed to be confusable.

Stage 2 optimizes the body. The system classifies chunks of the skill into categories: core rule, background, example, template, or redundant. Core rules stay in the always-loaded body. Background, examples, and templates are moved into on-demand modules with metadata explaining when to load them. The system also checks faithfulness, evaluates tasks, and can promote material back into the core if compression breaks behavior.

The main evaluation covers 600 skills: 87 official SkillHub skills, 464 community skills, and 49 wild GitHub skills. The authors compare three conditions: no skill, original skill, and compressed skill. They also test SkillsBench, five different models, and OpenCode as an independent agent framework.

What They Found

Many skills cannot route cleanly

The description layer is weaker than it should be.

Among 55,315 wild skills, 26.4% have no description at all. A broader 44.1% are either missing or under 20 tokens. That means the agent’s routing layer is looking at artifacts that are too thin to explain when they apply.

The opposite failure also appears. Many descriptions that do exist are padded with feature lists, redundant trigger phrases, or usage examples that do not help the router distinguish one skill from another.

This is a familiar interface problem. Too little metadata and the component cannot be selected. Too much generic metadata and the component is noisy. A skill description is not a miniature README. It is an invocation boundary.

Most skill bodies are not core instructions

The body layer has the bigger token problem.

The authors classify 15,107 paragraph-level items from 90 skills. Only 38.5% are actionable core rules. The rest is mostly background, examples, and templates: 40.7% background, 12.9% examples, 7.6% templates, and a small amount of redundant content.

That does not mean background and examples are useless. It means they should not always be loaded by default.

This is one of the paper’s strongest product lessons. Skills need progressive disclosure. The agent should start with the procedure and fetch supporting material only when the task needs it.

References can quietly dominate the context budget

Most wild skills are single-file artifacts, but the skills with references can be expensive.

The paper reports that 14.8% of wild skills include reference files, and the 100 SkillHub skills collectively include 505 files totaling 1.67 million tokens. A single invocation can therefore pull in large documentation packs even when only a small part is relevant.

This is the same failure mode that hurt early RAG systems: “available” becomes “included,” and included becomes “attention tax.”

Compression helped more often than it hurt

On 600 evaluated skills, SkillReducer reduces descriptions by 48% on average and bodies by 39% on average. The body reduction saves roughly 1,000 tokens per skill.

The quality result is more interesting. Compressed skills preserve or improve task score for 86.0% of skills. They improve 25.3% of skills and regress 14.0%. Average score rises from 0.722 with original skills to 0.742 with compressed skills.

The authors call this a less-is-more effect. Removing non-essential content can reduce distraction and make the agent follow the actual operating procedure more reliably.

They also argue that many apparent regressions are not true compression failures. Some skills are obsolete because the model performs well without them. Some differences are evaluation noise. The authors estimate that only 4.7% of all skills are true compression failures.

That estimate should be treated carefully, but the direction is credible: bloat is not just a cost problem. It can become a performance problem.

Structure-aware compression beats naive compression

The paper compares SkillReducer against LLMLingua, direct LLM compression, truncation, and random sentence removal under the same token budget.

SkillReducer wins.

The reason is intuitive. Naive compression treats the skill like a flat text sequence. It may remove linguistically predictable material that is operationally important. A terse checklist item can matter more than a paragraph of explanation, even if the paragraph looks richer to a language model.

SkillReducer’s advantage comes from preserving the distinction between rules, examples, background, templates, and references. It is not simply shorter. It is shorter in the right places.

Why It Happens

Skill bloat happens because skill authors are solving two jobs with one file.

One job is runtime control. The agent needs a compact procedure: when to use the skill, what to do, what not to do, what to verify, and when to stop.

The other job is human documentation. The author wants to explain the rationale, show examples, preserve templates, document edge cases, and make the skill understandable to future maintainers.

Both jobs are valid. The mistake is loading both into the model by default.

Software has had this separation for decades. Runtime code, tests, examples, comments, docs, fixtures, and generated artifacts do not all ship into the same hot path. Agent skills need the same distinction.

The paper’s taxonomy is useful because it gives skill systems a build step:

Core rules go into the hot path.

Examples become optional evidence.

Templates become fetchable artifacts.

Background becomes documentation.

References become indexed or routed modules.

That structure also explains the main failure mode. Sometimes an example is not merely an example. It is the only place the author encoded the real rule. The paper calls this out indirectly through its regression analysis: example-as-specification can break compression because the classifier moves an example out of core even though the agent needed it to infer behavior.

That is not only a classifier problem. It is an authoring problem. If a rule matters, it should be stated as a rule.

What This Means for Builders

The first takeaway is to treat skills as compiled artifacts.

The authoring format can be generous. Let people write examples, notes, background, and references. But the runtime format should be optimized. Before a skill is installed or published, it should pass through checks that produce a compact routing description, a minimal core, and on-demand supporting modules.

The second takeaway is that skill quality is not just about “good instructions.” It is about interfaces.

A skill needs a routing interface, a runtime core, dependency boundaries, reference loading rules, test cases, and retirement criteria. That is software engineering, not prompt polishing.

The third takeaway is to test compressed skills by behavior, not vibe.

The paper’s strongest safeguard is not the LLM rewriting step. It is the feedback loop: run tasks with the compressed skill, compare against the original, and promote missing material back into core when behavior drops. That is the right pattern for real systems. Compression without evaluation is just hopeful deletion.

Finally, builders should be cautious with generic prompt-compression tools. They can be useful, but skills are structured operational documents. Flattening them into tokens loses the part that matters: what is required, what is optional, and what should only be loaded on demand.

What This Means for Buyers and Operators

Buyers should ask how an agent platform manages skill bloat.

The obvious questions are not fancy:

How are skills routed?

How are descriptions reviewed?

Does the platform distinguish core instructions from examples and background?

Can references be loaded on demand?

Are skills tested after edits?

Can obsolete skills be detected and retired?

Does the system measure token cost per skill invocation?

These questions matter because skill libraries tend to grow. Every successful workflow becomes a candidate skill. Every team wants its own variation. Every product area accumulates procedures. Without lifecycle management, the library becomes a pile of plausible instructions competing for attention.

There is also a procurement lesson for smaller models. The paper’s cross-model tests suggest compressed skills can transfer across model families, with mean retention of 0.965. It also notes that weaker models may benefit more from skill content than stronger ones. That implies a good skill layer can be part of model-cost strategy.

But only if the skills are sharp.

A bad skill library can erase the savings. It burns context, creates routing ambiguity, and makes the agent carry organizational clutter into every task.

What to Watch Next

The field should watch whether skill systems develop a real build pipeline.

The useful version will look boring: lint skill metadata, test routing, split core from references, deduplicate files, measure invocation cost, run behavioral checks, and flag obsolete skills. That is how skills become infrastructure rather than artisanal Markdown.

The field should also watch for independent replication. The paper’s main evaluation is large, but Gate 2 is both part of the optimization process and part of the evaluation story. External benchmarks, third-party skill libraries, and platform-native tests in Cursor, Windsurf, Claude Code, OpenCode, and OpenClaw would make the result stronger.

Finally, watch for authoring conventions that prevent example-as-specification. Skill systems should encourage authors to extract implicit rules from examples before the build step runs. If an example defines behavior, the rule should live in the core.

Limitations and Caveats

This is an arXiv preprint, not a settled standard.

The main evaluation uses Anthropic’s skill protocol. The paper includes OpenCode validation, but Cursor, Windsurf, and other agent environments are not directly tested.

Some quality claims depend on generated tasks and LLM-based evaluation. The authors mitigate this with separate models, deterministic SkillsBench checks, and judge validation against code-execution verifiers, but independent replication would still matter.

SkillsBench also has a ceiling effect in this paper: 87 of 87 tasks pass with original and compressed skills, but 86 of 87 pass even with no skill. That makes it useful as a regression check, not as strong evidence that skills improve performance.

The “only 4.7% true compression failures” claim is plausible but should be read as the authors’ decomposition, not a universal rate. Real organizations may have messier skills, more hidden dependencies, and more examples that encode unstated rules.

The central conclusion still holds: as skills become a reusable layer in agent systems, unmanaged skill bloat becomes an operating cost. The fix is not to stop writing skills. It is to compile, test, and retire them like real software artifacts.

Source

Yudong Gao, Zongjie Li, Yuanyuanyuan, Zimo Ji, Pingchuan Ma, Shuai Wang. (2026). SkillReducer: Optimizing LLM Agent Skills for Token Efficiency. arXiv preprint arXiv:2603.29919. Available at: https://arxiv.org/abs/2603.29919