A new paper argues that the instructions around an AI agent should not be treated as a one-time prompt. They can be trained and shipped like a small operating asset.
Source note: Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou, Zisu Huang, Yan Li, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Yuqing Yang, Dongdong Chen, Xue Yang, and Chong Luo. “SkillOpt: Executive Strategy for Self-Evolving Agent Skills.” arXiv:2605.23904v2, revised May 25, 2026. https://arxiv.org/abs/2605.23904
Why This Paper Matters
Most teams still treat agent skills as written instructions. Someone writes a SKILL.md, a system prompt, or a tool-use guide. Maybe another model generates a first draft. Maybe the team edits it after a failure. But the process is usually informal. The skill improves because a person notices something, not because the skill itself is being trained against evidence.
That is a problem if skills become part of how serious agent systems work. A skill might tell a coding agent how to use a repo, tell a spreadsheet agent how to inspect workbooks, or tell a research agent how to cite sources. If that skill is wrong, vague, bloated, or stale, the agent inherits the weakness every time it runs.
SkillOpt matters because it treats the skill as the trainable state around a frozen model. The model weights do not change. The deployment harness does not change. The thing that changes is the compact text artifact that tells the agent how to behave in a domain.
That sounds modest, but it points to a practical middle layer between prompt engineering and fine-tuning. Instead of asking whether every improvement requires a new model, the paper asks whether some improvements can live in reusable, validated skills.
The Idea in Plain English
The easiest analogy is training a field manual.
Imagine a support team that writes a short manual for handling a class of tickets. The first version is generic. After seeing enough real tickets, the team notices recurring mistakes: missing account checks, wrong escalation paths, weak final summaries. A good manager would not rewrite the whole manual every day. They would make small edits, test whether those edits improve outcomes, keep the ones that work, and reject the ones that make things worse.
SkillOpt applies that logic to AI-agent skills.
The target agent runs tasks with the current skill. The system collects scored rollouts: what the agent did, where it succeeded, where it failed, and what feedback the benchmark produced. A separate optimizer model reads those rollouts and proposes bounded edits to the skill document. Those edits can add, delete, or replace text. The candidate skill then gets tested on a held-out validation split. It is accepted only if it improves the validation score.
The deployed result is not a new model and not a larger runtime agent loop. It is just a better best_skill.md.
What the Researchers Tested
The researchers tested SkillOpt across six benchmark families:
- SearchQA for question answering.
- SpreadsheetBench for spreadsheet reasoning and code-assisted workbook edits.
- OfficeQA for local-document and tool-use tasks.
- DocVQA for visual document question answering.
- LiveMathematicianBench for math multiple-choice reasoning.
- ALFWorld for embodied sequential decision making.
They tested seven target models, from frontier-scale GPT systems to smaller Qwen models. They also tested three execution modes: direct chat, Codex, and Claude Code. That harness coverage matters because a skill that only works in a clean chat prompt is less useful than one that survives tool-use environments.
The comparison set was broad. SkillOpt was tested against no-skill runs, human-written skills, one-shot LLM-generated skills, Trace2Skill, TextGrad, GEPA, and EvoSkill. The paper asks a sharper question than whether a skill helps: does a validation-gated skill optimizer beat several plausible ways of creating or improving instructions?
What They Found
The headline result is strong. Across six benchmarks, seven target models, and three harnesses, SkillOpt was best or tied for best in all 52 evaluated model-benchmark-harness cells.
On GPT-5.5, the reported average improvement over no skill was 23.5 percentage points in direct chat, 24.8 points inside the Codex agentic loop, and 19.1 points inside Claude Code. The paper also reports that SkillOpt beat the strongest per-cell competitor among human skills, one-shot LLM skills, Trace2Skill, TextGrad, GEPA, and EvoSkill.
The biggest gains came from procedural work
SkillOpt helped most where the agent needed repeatable operating discipline more than background knowledge. SpreadsheetBench and OfficeQA are good examples. These tasks reward habits like inspecting the actual workbook, preserving formatting, filling the complete target range, checking output cells, and respecting answer formats.
That is exactly the kind of thing a skill can encode. It is also the kind of thing a strong model may fail to do consistently without a domain-specific operating procedure.
The skill stayed portable
The paper reports transfer experiments across model scales, across Codex and Claude Code harnesses, and from one math benchmark to another nearby math benchmark. The transferred skills still helped.
That does not mean every skill will transfer everywhere. It does suggest the optimized artifact was doing more than memorizing one benchmark’s examples. In the strongest interpretation, the skill captured reusable procedure: how to inspect a workbook, how to bind answers to evidence, how to avoid loops, or how to preserve output constraints.
The deployed artifact stayed small
One useful detail is that the optimized skills remained compact. The training process may examine many rollouts and reject many candidate edits, but the deployed artifact is still a readable text file. The paper describes the final skills as small enough for a domain practitioner to inspect and audit.
That is the operational appeal. The cost and complexity happen during training. Deployment gets a compact instruction asset with no extra optimizer calls at inference time.
Why It Happens
SkillOpt works because it borrows the discipline of training without requiring model-weight updates.
First, it uses a held-out validation gate. The optimizer model is not allowed to revise the skill just because its explanation sounds plausible. The revised skill has to improve the validation score. Ties are rejected. That keeps the skill from drifting through well-written but unhelpful edits.
Second, it uses a textual learning-rate budget. Instead of letting the optimizer rewrite the whole skill, SkillOpt limits how many add, delete, or replace edits can be applied at each step. The budget can decay over time, so early training can make larger moves and later training can consolidate.
Third, it keeps rejected edits as negative feedback. A bad edit is not silently forgotten. The optimizer can use that rejection history to avoid repeating the same mistake in the next round.
Fourth, it uses a slower epoch-level meta update. The fast loop learns from recent rollouts. The slower loop tries to preserve patterns that hold across longer stretches of training. The paper separates this optimizer-side memory from the deployed skill, which keeps the final artifact smaller and cleaner.
The practical point is that SkillOpt goes beyond “ask an LLM to improve the prompt.” It is a propose-test-reject loop for a text artifact.
What This Means for Builders
For builders, the paper suggests that skills should be managed like product infrastructure, not like loose prompt snippets.
The first implication is that skill changes need evaluation. If a skill controls agent behavior in coding, research, finance, operations, or customer workflows, a better-sounding instruction is not enough. Teams need a way to replay representative tasks and check whether the skill improves outcomes.
The second implication is that skills should be versioned and audited. A skill that changes over time is closer to a model artifact than a static README. Teams should know which examples drove the change, which edits were accepted, which edits were rejected, and what validation set guarded the release.
The third implication is that skill libraries may become a real operating surface. If optimized skills can transfer across models and harnesses, then a company can build domain-specific skills that outlive a single model release. The model may change, but the audited procedure for handling spreadsheets, source citations, deployment checks, or customer escalations can remain useful.
The fourth implication is cost placement. SkillOpt spends extra computation during training, but the deployed skill does not require extra inference-time optimizer calls. That matters in production systems where latency, cost, and predictability are not footnotes.
What This Means for Buyers and Operators
For buyers and operators, the paper gives a better set of questions to ask vendors.
Do not stop at asking whether the agent has “skills” or “memory.” Ask how those skills are evaluated. Are they hand-written? Generated once? Revised after failures? Tested against held-out tasks? Versioned? Can a buyer or operator inspect them?
The distinction matters because a self-improving skill system can fail quietly if it has weak gates. If the validation set is too narrow, the skill may overfit. If the scorer is superficial, the skill may learn to optimize the wrong behavior. If humans cannot inspect the final artifact, operators may not know what rules the agent has learned.
SkillOpt’s strongest operator message is that improvement should be evidence-backed and reversible. A skill should not become more trusted simply because it has been edited many times. It should become more trusted because accepted edits repeatedly improved held-out performance while leaving the artifact readable.
What to Watch Next
The field should watch whether this style of skill optimization moves from benchmarks into messy business workflows.
The hard cases will not look like clean academic tasks. They will involve subjective quality, partial credit, conflicting policies, sparse feedback, and human review. A sales research skill, a support triage skill, or a security investigation skill may not have a simple exact-match scorer. Skill optimization in those domains will need stronger evaluation design, including human judgment, model-graded rubrics, and rollback procedures.
Skill-library design is the other thing to watch. This paper optimizes one compact skill for one target domain. Real systems may need many skills with boundaries, dependencies, ownership, and release gates. That turns skill management into a governance problem as much as an optimization problem.
Finally, researchers should test longer-horizon transfer. It is promising that optimized skills transferred across model scales and between Codex and Claude Code in the paper. The next question is whether skills remain useful after models, tools, APIs, and business processes change.
Limitations and Caveats
The results are benchmark-driven. The paper uses held-out test splits and a broad set of tasks, but the method still depends on having reliable scored trajectories. Domains with subjective outcomes, safety constraints, or long-delayed feedback will be harder.
The validation gate is only as good as the validation design. A poor selection split can accept the wrong edits. A narrow benchmark can reward brittle heuristics. A misleading scorer can teach the skill the wrong lesson.
The method also adds training-time cost. The deployed skill is compact, but getting there requires rollouts, optimizer calls, validation checks, and management of rejected edits. That is likely worth it for reusable, high-value skills. It is probably overkill for one-off tasks.
The right takeaway is not that agent skills can safely rewrite themselves in production. The better takeaway is that skills can be improved through a controlled offline loop, with bounded edits, held-out validation, auditability, and human review for workflows that can affect real users or systems.
Source
Yang, Y., Gong, Z., Huang, W., Yang, Q., Zhou, Z., Huang, Z., Li, Y., Gao, X., Dai, Q., Liu, B., Qiu, K., Yang, Y., Chen, D., Yang, X., & Luo, C. (2026). SkillOpt: Executive Strategy for Self-Evolving Agent Skills. arXiv preprint arXiv:2605.23904. Available at: https://arxiv.org/abs/2605.23904