The paper’s practical point: agent skills are not just better prompts. They are reusable procedural software, and they need the same discipline as software supply chains.
Source note: Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. “SoK: Agentic Skills – Beyond Tool Use in LLM Agents.” arXiv:2602.20867, submitted February 24, 2026. https://arxiv.org/abs/2602.20867
Why This Paper Matters
The first wave of agent talk was about tools.
Give the model a browser. Give it a shell. Give it a database connector. Give it a calendar API. If the model can choose tools and chain them together, the agent starts to look useful.
This paper argues that tool use is not the mature abstraction. The next layer is the skill.
A tool is an atomic capability: search, write a file, run a query, call an API. A skill is a reusable way of doing work: debug this kind of failure, process this kind of report, inspect this kind of repository, fill this kind of form, run this kind of audit.
That distinction sounds small until the agent operates repeatedly. Without skills, every task is reconstructed from scratch inside a context window. With skills, procedural knowledge becomes reusable. The system can carry forward not just facts, but ways of acting.
The paper’s warning is that this turns skills into infrastructure. Once a skill can execute code, invoke tools, route work, or influence the agent’s reasoning, it becomes part of the operational attack surface. A bad skill is not a bad answer. It is a compromised procedure.
The Idea in Plain English
The authors define an agentic skill as a reusable callable module with four parts:
- An applicability condition: when should this skill run?
- An executable policy: what should it actually do?
- A termination condition: when is it done?
- A callable interface: how does the agent or another skill invoke it?
That is the useful move in the paper. It separates skills from nearby ideas.
A tool has an interface, but no multi-step procedure. A plan decomposes one task, but usually disappears after the session. Memory stores what happened, but does not execute. A prompt template can shape behavior, but it has no clear completion boundary.
A skill combines reuse, execution, and governance.
In software terms, the paper treats skills less like “tips for the model” and more like library functions or operational playbooks. They package know-how. They can be selected, composed, tested, versioned, and revoked. They can also drift, conflict, over-trigger, or become malicious.
That is why the paper’s title matters: skills are beyond tool use because they are the procedural layer of agent systems.
What the Researchers Tested
This is a Systematization of Knowledge paper, not a new model paper.
The authors review agent systems, skill libraries, tool-use frameworks, robotics skill work, software engineering agents, marketplace patterns, and recent benchmarks. They organize the space around three linked questions:
What is a skill?
How do real systems package and run skills?
How should skills be governed and evaluated?
The paper contributes three main structures.
First, it proposes a lifecycle: discovery, practice, distillation, storage, retrieval and composition, execution, then evaluation and update.
Second, it offers seven design patterns for skill systems: metadata-driven disclosure, code-as-skill, workflow enforcement, self-evolving skill libraries, hybrid natural-language plus code macros, meta-skills, and plugin or marketplace distribution.
Third, it maps risks and evaluations: how skills can be attacked, how trust tiers might work, and how benchmarks can measure whether skills help.
What They Found
Skills sit between prompts and agents
The paper’s cleanest contribution is conceptual.
Agent systems have been using skill-like things under different names: playbooks, scripts, plugins, memories, policies, macros, workflows, MCP servers, and tool wrappers. The authors argue that these should be seen as variants of one layer.
That layer is procedural memory for agents.
This matters because procedural memory is different from context. A model may know a debugging pattern in the abstract, but a skill can carry the exact steps, checks, files, commands, and stop conditions that make the pattern repeatable in one environment.
For builders, the implication is direct: do not ask the model to rediscover an operating procedure every time. Turn proven procedures into skills with explicit entry and exit conditions.
Metadata-driven skills scale context, but create retrieval risk
One pattern is metadata-driven disclosure.
The agent sees compact metadata first: name, description, triggers, and maybe a short summary. The full instructions load only when the skill is selected. This keeps the context window small while allowing the system to know about many possible skills.
The tradeoff is that retrieval becomes a control point.
If the metadata is vague, misleading, or adversarial, the wrong skill can be selected. A malicious skill does not need to defeat the whole system at first. It only needs to look relevant enough to be loaded.
That makes skill descriptions part of the security boundary, not just documentation.
Code skills are testable, but dangerous
Code-as-skill is the most software-like pattern.
The skill is an executable program: Python, shell, JavaScript, a notebook, a robot-control script, or some other callable artifact. This is attractive because code can be tested, sandboxed, linted, versioned, and composed.
It is also dangerous because code skills can touch real systems.
The paper’s governance argument is pragmatic: code skills need software controls. Static analysis, dependency scanning, sandboxing, version pinning, and permission boundaries matter because the agent may run the skill with access to files, credentials, browsers, databases, or production tools.
The more useful the skill, the more seriously its execution rights should be treated.
Workflow skills constrain the agent before it improvises
Workflow enforcement is a different pattern.
Instead of giving the agent a bag of reusable tricks, a workflow skill forces a sequence: reproduce the bug before fixing it, write tests before patching, run validation before declaring success, inspect logs before guessing root cause.
This pattern is less glamorous than self-improving agents, but it may be more dependable.
The reason is simple: many agent failures are not failures of intelligence. They are failures of process. The model skips diagnosis, trusts its first hypothesis, forgets to verify, or stops when the answer sounds plausible.
A workflow skill reduces the space for that kind of shortcut. It makes the agent behave more like a disciplined operator.
Self-evolving skill libraries are promising, but still fragile
The most ambitious pattern is the self-evolving library.
After a successful task, the agent distills what worked into a new skill or updates an existing one. Over time, the library should become better. In constrained environments, such as games or robotics setups with clear success signals, this can work.
The paper is cautious about open-ended settings.
It cites benchmark evidence suggesting self-generated skills can hurt performance when they are admitted without strong verification. The problem is not that agents cannot generate useful procedures. The problem is that they also generate brittle, overbroad, redundant, or misleading procedures.
That creates skill debt.
Like technical debt, skill debt accumulates when artifacts are easy to create and hard to retire. A bloated skill library can slow retrieval, trigger the wrong procedure, conflict with newer practices, or make the agent worse than it was without skills.
Marketplace distribution makes skills a supply-chain problem
The paper’s sharpest security section is about marketplace-distributed skills.
A marketplace can grow a skill ecosystem quickly. It also creates the classic package-registry problem: anyone can publish something that looks useful, users install it, and the code or instructions run inside an environment with valuable access.
The authors use the reported ClawHavoc campaign against OpenClaw’s ClawHub registry as a case study. The paper says researchers identified nearly 1,200 malicious skills, with attacks ranging from poisoned metadata and name-squatting to credential theft, reverse shells, and prompt-injection payloads embedded in skill documentation.
The exact case should be read as the paper’s reported example unless independently verified. The broader point does not depend on one incident: once skills are distributable packages, agent security inherits software supply-chain security and adds new language-model-specific failure modes.
A malicious package can attack through code. A malicious skill can attack through code, metadata, natural-language instructions, applicability triggers, and the agent’s own tendency to treat loaded instructions as relevant context.
Why It Happens
Skills are powerful because they compress experience.
That compression is also the failure mode.
A good skill takes a messy sequence of decisions and turns it into a reusable artifact. But the artifact has to decide when it applies, what it can access, how long it should run, what success looks like, and how it composes with other skills.
Each of those is a governance question.
If applicability is too broad, the skill activates when it should not. If the policy is underspecified, the agent improvises inside the procedure. If termination is weak, the skill stops too early or loops too long. If the interface is misleading, other skills or agents call it incorrectly.
The paper’s formal definition is useful because it gives operators a checklist. Do not only ask, “What does this skill say?” Ask:
Can it self-select safely?
Can it execute predictably?
Can it stop correctly?
Can it be invoked without ambiguity?
That is the difference between a helpful instruction file and a governed procedural component.
What This Means for Builders
The strongest practical takeaway is to treat skill creation as software engineering.
Skills need review, tests, versioning, ownership, and retirement. They also need permission design. A skill that reads docs should not inherit the same access as a skill that runs shell commands. A skill loaded for instruction should not automatically get execution rights.
The paper proposes trust tiers that are useful for product design:
- Metadata only: the agent can see the skill name and description.
- Instruction access: the agent can read the skill’s guidance.
- Supervised execution: the skill can act, but with approval or sandboxing.
- Autonomous execution: the skill can run within predefined boundaries.
This should shape real agent platforms. Trust should not be binary. Loading a skill, reading a skill, and executing a skill are different events. They deserve different permissions.
The other practical point is evaluation. The paper emphasizes deterministic harnesses: do not grade a skill by whether its reasoning sounds nice. Grade it by whether the environment ended up in the right state.
For code skills, that can be tests. For web skills, it can be state verification. For workflow skills, it can be evidence that required steps were completed. For security-sensitive skills, it should include adversarial cases and permission checks.
What This Means for Buyers and Operators
Buyers should stop asking only whether an agent has tools.
The better questions are about the skill layer:
Which procedures are reusable?
Who authored them?
How are they reviewed?
What can they access?
How are they tested?
How are they updated or revoked?
What happens when a skill conflicts with the current task?
This matters because skills can make smaller or cheaper models more useful. The paper summarizes SkillsBench evidence where curated skills raised average pass rates by 16.2 percentage points, from 24.3% to 40.6%. It also reports that smaller models with curated skills can sometimes match or beat larger models without skills.
That is the optimistic story: skills convert human procedural knowledge into reusable leverage.
The pessimistic story is that weak skills make agents worse. The same benchmark summary says self-generated skills averaged 1.3 percentage points below the no-skill baseline. Some tasks degraded sharply when irrelevant or conflicting skills activated.
The buyer lesson is blunt: “supports skills” is not enough. The value comes from curated, scoped, verified skills. An ungoverned skill marketplace may be worse than no marketplace at all.
What to Watch Next
The field should watch whether agent platforms build skill governance before skill ecosystems explode.
The obvious signs are provenance signing, review workflows, sandboxing, dependency scans, version pinning, trust tiers, and usage monitoring. Those are boring controls, which is exactly why they matter.
The field should also watch verification-gated self-generation. The dream is not an agent that writes endless skills. The dream is an agent that can propose a skill, test it against held-out tasks, prove it does not expand permissions, and only then admit it to the library.
Finally, watch whether skill libraries stay small and sharp. The paper’s SkillsBench discussion suggests focused skills can help more than comprehensive documentation. That is an important product lesson. A skill is not a knowledge dump. It is a compressed procedure.
Limitations and Caveats
This paper organizes a young field. Its taxonomy is useful, but it will not be final.
Many of the systems it discusses are recent, and some evidence comes from benchmarks or incidents that still need independent replication. The SkillsBench numbers are treated carefully by the authors as strong evidence, not settled law. The ClawHavoc case study should also be read as a reported security example from the paper, not as a claim independently verified here.
There is also a boundary problem. In real products, the line between a prompt, a workflow, a tool wrapper, an MCP server, and a skill can blur. The paper’s definition helps, but platform implementations will still vary.
The main conclusion survives those caveats: if agents are going to act repeatedly in real environments, reusable procedural knowledge will become a first-class layer. Once it does, skill quality and skill governance become product requirements, not optional polish.
Source
Jiang, Y., Li, D., Deng, H., Ma, B., Wang, X., Wang, Q., and Yu, G. (2026). SoK: Agentic Skills – Beyond Tool Use in LLM Agents. arXiv preprint arXiv:2602.20867. Available at: https://arxiv.org/abs/2602.20867