1. The Orchestration Tax — Addy Osmani
- Why read: A mental model for the hidden costs of running multiple AI agents.
- Summary: Spinning up agents is cheap. Reviewing their output is expensive. This "orchestration tax" works like Amdahl's Law in concurrent programming: your human judgment is the serial bottleneck limiting total throughput. Running 20 agents yields nowhere near 20x the output if a human must verify everything. To scale agents, build systems that reduce the cognitive load of review rather than blindly increasing parallel generation.
- Read more
2. Claude Code's Dynamic Workflows: the model writes its own harness — Necø
- Why read: Shows how AI models now write and run their own multi-agent coordination scripts.
- Summary: Anthropic's Dynamic Workflows feature lets Claude write JavaScript on the fly to coordinate large fleets of subagents. Instead of using static, human-written harnesses, the model plans a workflow, runs parallel stages, validates typed JSON outputs, and reports back. This converts large tasks like bug hunts and refactors from slow serial operations into parallel problems with independent verification. AI coding is moving from generating raw logic to building the systems that orchestrate other agents.
- Read more
3. Dynamic Workflows vs Skills vs Subagents: What are the differences? — AlphaSignal AI
- Why read: Breaks down three core agent development patterns and when to use them.
- Summary: Subagents, skills, and dynamic workflows solve different problems in complex tasks. Subagents let the main model decide the next step but consume context tokens with every result. Skills provide reusable instructions for specific tasks while staying in the main loop. Dynamic workflows push orchestration into code, running stateful, parallel operations outside the chat context and returning final results. Use workflows for heavy, repeatable work. Save skills and subagents for tasks that fit comfortably in the model's context window.
- Read more
4. I've been using state-of-the-art models to teach small models running... — Tomasz Tunguz
- Why read: A framework for using frontier models to train small, local models on personal workflows.
- Summary: Skill distillation uses frontier models to write atomic skill files and evaluations. Smaller, local models like Qwen or Gemma then execute those procedures. The knowledge transfers via markdown instructions instead of weight updates, meaning the small model only needs to follow steps instead of understanding the domain. This builds institutional memory to power fast, private personal agents.
- Read more
5. Building self-improving tax agents with Codex | OpenAI — OpenAI
- Why read: A case study on building agents that use production traces to improve themselves.
- Summary: OpenAI engineers built a tax-preparation agent that refines itself using real-world signals. Instead of manual engineering updates, the system uses Codex to inspect edge cases and adjust prompts based on expert feedback. This loop pushed the percentage of returns hitting the 75% accuracy threshold from 25% at launch to 86% in six weeks. Pairing practitioner feedback with structured production traces creates viable self-improving agents for enterprise work.
- Read more
6. Managing context in long-run agentic applications — Dominic Marks
- Why read: Solutions for maintaining coherence in multi-agent systems across hundreds of API calls.
- Summary: Long-running agents hit context limits and degrade over time. Slack engineering solved this with three context channels: a Director's Journal for working memory, a Critic's Review for annotated findings, and a Critic's Timeline for chronology. Tailoring the state view for each specialist prevents bloat while keeping agents anchored to team goals. It avoids incoherent isolation and prevents the confirmation bias caused by oversharing context.
- Read more
7. Production traffic from frontier models is a golden data asset — Viv
- Why read: How to use expensive frontier models to generate training data for cheaper models.
- Summary: Frontier models in production generate high-quality execution traces. Teams can mine and filter these traces to apply Supervised Fine Tuning and Knowledge Distillation to smaller, cheaper models. This creates a learning loop where new traces feed datasets and produce better student models. Online evaluators and bulk-trace agents identify exactly which data pushes small models to frontier-level performance on specific tasks.
- Read more
8. Verifying Agentic Development at Scale — Ido Pesok
- Why read: How to implement end-to-end cloud testing for autonomous verification of AI code changes.
- Summary: Agents need to verify their own work before opening pull requests. Cognition runs fleets of Devin agents in parallel cloud VMs to click through applications and validate features like human testers. Early versions got lost in setup or over-tested unrelated areas, making guardrails necessary. Automating the verification loop prevents a backlog of unverified agent PRs and enables asynchronous engineering.
- Read more
9. Building workflows for agents with Skills + Interpreters — Hunter Lovell
- Why read: A pattern for extending AI agent skills with embedded TypeScript interpreters.
- Summary: Agents excel at writing code to solve problems, but many tasks require following a strict procedure instead of inventing logic on the fly. "Interpreter skills" solve this by attaching an executable TypeScript module to a markdown skill description. The agent determines relevance and supplies inputs. A sandboxed runtime handles execution. This pairs LLM decision-making with deterministic software execution.
- Read more
10. The Operator's Case against Founder Mode — Gokul Rajaram
- Why read: An argument against the "founder-genius" archetype in favor of operational velocity and scale.
- Summary: Dick Costolo's turnaround of Twitter shows that scaling organizations need a different playbook than "Founder Mode." Centralized decision-making creates bottlenecks. Leaders must push decisions down to Directly Responsible Individuals. Building a default-to-yes culture and cutting veto points masquerading as process restores velocity. The leader's job shifts from making all the calls to ensuring the entire org understands the strategic context.
- Read more
11. The cardinal sin of platform building; the FDE wars intensify — Matt Slotnick
- Why read: A warning for incumbents rushing to declare themselves AI platforms.
- Summary: Exposing an API or a CLI for agents does not turn a point solution into a platform. True platforms provide more value to the ecosystem than they capture. Companies fail when they build platform abstractions before securing a killer app, optimizing for nonexistent usage patterns. Winning the AI transition means solving hard last-mile problems and forces SaaS incumbents to operate more like hyperscalers.
- Read more
12. Hermes Harness Architecture — Aparna Dhinakaran
- Why read: Architecture breakdown of an open-source agent harness and its context management.
- Summary: The Hermes harness normalizes tool-call formats across model providers. It manages context via compression: an auxiliary model summarizes older turns while protecting the head and tail segments. Rather than endlessly rewriting one transcript, Hermes builds a parent-child lineage of sessions, keeping long conversations traceable. Separating tool registration from tool exposure lets developers manage safety and token costs on a per-run basis.
- Read more
13. I read StoryScope, a paper on AI fiction and narrative... — Muratcan Koylan
- Why read: Outlines the statistical signatures that distinguish AI fiction from human writing.
- Summary: A study of 70,000 stories mapped the feature vectors separating human and AI narrative. AI models over-explain, default to tidy plots, and lean on embodied emotion. Humans use ambiguity and external references. Prompting for "human voice" achieves only surface-level imitation. Authentic AI personas require tuning decision-level structural preferences, like trade-off priorities and failure-mode awareness.
- Read more
14. Measure Less to Learn More: Using Fewer, Higher-quality Metrics to Capture What Matters — Jake Mainwaring
- Why read: Discord's data team explains why reducing default metrics improves experiment outcomes.
- Summary: Scaling organizations bloat their metric lists from fear of missing insights. Tracking too many metrics triggers the multiple comparisons problem. This forces strict statistical corrections that reduce the odds of detecting true positive changes. Methods like the Benjamini-Hochberg correction control false discoveries but sacrifice recall. The fix is tracking fewer, high-quality metrics to maintain statistical power and cut noise.
- Read more
15. 📢 New Preprint — Farima Fatahi
- Why read: A method for using LLMs to diagnose and fix their own prompt failures.
- Summary: Manual prompt engineering is slow and scales poorly. Reflective Prompt Tuning uses an optimizer LLM to evaluate a target model's failures, cluster error modes, and generate diagnostic reports. The optimizer then applies targeted patches, such as adding verification steps for math errors or adjusting relation handling for multi-hop questions. This framework treats confidence calibration as an optimization signal alongside task accuracy.
- Read more
Themes from yesterday
- Agent Orchestration and Scaling: The shift from optimizing single LLM calls to building dynamic workflows, managing the orchestration tax, and maintaining context across agent fleets.
- Model Distillation and Training: Using frontier models to train smaller, local models through procedural skills and production traces.
- Organizational Velocity: Reevaluating operational practices—like cutting default metrics and pushing decisions down to Directly Responsible Individuals—to maintain speed as systems scale.