Lessons from Charlie O’Neill

Charlie O’Neill is an AI researcher, Head of Model Training at Baseten, and a John Monash Scholar at Oxford. His work on LLM optimization and the STILL model pulls academic theory into the reality of production machine learning. This profile covers his approach to scaling compute and the open-source infrastructure required to actually deploy models.

Part 1: The Core of Inference Engineering

  1. On inference as the frontier: "Inference is becoming just as important as the actual research itself because it is so tightly coupled with it. A model is only as good as our ability to run it." — Source: [Baseten Podcast]
  2. On the invisible bottleneck: "The bottleneck in modern AI isn't finding data; it is serving the weights quickly enough to make the interactions feel human." — Source: [Towards Data Science]
  3. On specialization: "You don't have to have a PhD to become an inference engineer, but you do need an obsessive focus on hardware utilization." — Source: [Baseten Podcast]
  4. On serving architecture: "Designing a great model architecture is half the battle. The other half is figuring out how to serve it without bankrupting the company." — Source: [Charles O'Neill Personal Site]
  5. On compute constraints: "We are shifting from compute-bound training to memory-bound inference, which changes the entire economics of deploying AI." — Source: [NeurIPS Presentation]
  6. On latency vs. throughput: "Optimizing for throughput makes the spreadsheet look good, but optimizing for latency makes the product feel good. You have to balance both." — Source: [Towards Data Science]
  7. On the democratization of speed: "Fast inference shouldn't be locked behind closed APIs. The open-source community needs tools that match proprietary speeds." — Source: [Baseten Blog]
  8. On the hardware lottery: "We often build models that happen to fit well on NVIDIA architectures, which means our research is implicitly constrained by hardware roadmaps." — Source: [Charles O'Neill Personal Site]
  9. On decoupling systems: "We must decouple the science of making models smart from the engineering of making them fast." — Source: [Baseten Podcast]

Part 2: The KV Cache and Memory Optimization

  1. On long-context models: "Infinite context is a marketing term until you solve the memory footprint of the KV cache." — Source: [NeurIPS Presentation]
  2. On state management: "Language models aren't stateless if you want them to be fast. Managing that state efficiently is the hardest problem in serving." — Source: [Towards Data Science]
  3. On the STILL model: "Compressing the KV cache by 8x isn't just an optimization; it fundamentally changes what developers can build with long-running agents." — Source: [The Neuron AI Newsletter]
  4. On memory walls: "We hit the memory wall much faster than the compute wall. Memory bandwidth is the silent killer of inference speed." — Source: [Baseten Blog]
  5. On attention mechanisms: "Standard attention scales quadratically, but the memory to store those attention keys scales linearly and becomes unmanageable." — Source: [Towards Data Science]
  6. On compaction strategies: "Not every token in the prompt is equally important. By dropping the least informative tokens from the cache, we save memory without losing accuracy." — Source: [Charles O'Neill Personal Site]
  7. On batching requests: "Continuous batching is a requirement, not a feature, if you want to run high-throughput LLM applications." — Source: [Baseten Blog]
  8. On caching trade-offs: "Every time you trade compute for memory, you have to measure the impact on the tail latency of the end-user." — Source: [Towards Data Science]
  9. On context limits: "The actual limit on context isn't the model's ability to reason, it is the GPU's VRAM screaming for mercy." — Source: [NeurIPS Presentation]

Part 3: Deploying AI to Production

  1. On the reality of production: "What works in a Jupyter notebook rarely survives first contact with production traffic." — Source: [Baseten Blog]
  2. On reliability: "A language model that is correct 99 percent of the time is still a liability if it fails catastrophically the other 1 percent." — Source: [Harvey AI Engineering Blog]
  3. On infrastructure: "AI infrastructure is just regular infrastructure that is exponentially more expensive and less forgiving." — Source: [Charles O'Neill Personal Site]
  4. On testing generative systems: "You can't write a unit test for vibes. You have to test distributions, latencies, and edge cases continuously." — Source: [Towards Data Science]
  5. On scaling challenges: "Scaling care beats scaling compute. If the user experience degrades under load, the model's intelligence doesn't matter." — Source: [Tuckwell Talks Podcast]
  6. On vendor lock-in: "Relying purely on proprietary APIs means outsourcing your core product latency to a third party." — Source: [Baseten Podcast]
  7. On monitoring: "Observability in AI is still primitive. We track token generation, but we struggle to track reasoning degradation over time." — Source: [Towards Data Science]
  8. On cold starts: "A 10-second cold start on a GPU instance is an eternity in software engineering." — Source: [Baseten Blog]
  9. On continuous deployment: "Deploying a new model weight is easy. Deploying a new model weight without regressing on specific user intents is incredibly hard." — Source: [Charles O'Neill Personal Site]

Part 4: Open Source vs. Closed Ecosystems

  1. On Big Token: "The dominance of 'Big Token' providers is a centralization risk. Open-source inference strategies are the only counterbalance." — Source: [The Neuron AI Newsletter]
  2. On Llama: "Open weights like Llama force the entire industry to compete on infrastructure and efficiency rather than just data moats." — Source: [Baseten Blog]
  3. On community momentum: "The open-source community will always win on efficiency because a thousand hobbyists optimizing CUDA kernels will outpace any single R&D team." — Source: [Towards Data Science]
  4. On transparency: "If you are going to use a model for high-stakes decisions, you must be able to inspect its weights and its training lineage." — Source: [Charles O'Neill Personal Site]
  5. On innovation: "The best inference optimizations—flash attention, PagedAttention—all came from open collaboration." — Source: [NeurIPS Presentation]
  6. On cost barriers: "Open source lowers the barrier to entry, but inference costs keep it artificially high. We have to solve the hosting side." — Source: [Baseten Podcast]
  7. On fine-tuning: "Owning your weights means owning your edge cases. You can't adequately fine-tune a model if you don't control the hosting." — Source: [Towards Data Science]
  8. On fragmentation: "The downside of open source is tooling fragmentation. We need unified standards for running models across different hardware." — Source: [Charles O'Neill Personal Site]
  9. On proprietary advantage: "Closed labs have a compute advantage, but the half-life of that advantage is shrinking every month." — Source: [Baseten Blog]
  10. On the ecosystem: "A healthy AI ecosystem requires a spectrum from massive closed frontier models to highly optimized open-source edge models." — Source: [Tuckwell Talks Podcast]

Part 5: The Economics of Compute

  1. On token economics: "We are moving from paying for software licenses to paying for tokens. The unit economics of the internet are shifting." — Source: [Towards Data Science]
  2. On GPU utilization: "An idle GPU is the most expensive mistake a startup can make." — Source: [Baseten Blog]
  3. On margin compression: "If your entire business is wrapping a prompt around an API, your margins belong to the API provider." — Source: [Charles O'Neill Personal Site]
  4. On the ROI of AI: "Companies are eager to deploy LLMs, but very few have calculated the actual return on investment per token." — Source: [Harvey AI Engineering Blog]
  5. On edge computing: "Pushing inference to the edge isn't just about latency. It is about shifting the compute cost from the provider to the user." — Source: [NeurIPS Presentation]
  6. On optimization value: "A 20 percent speedup in inference can be the difference between a profitable product and a discontinued experiment." — Source: [Baseten Podcast]
  7. On scaling laws and cost: "Scaling laws apply to compute costs as much as they apply to intelligence. At a certain point, the cost curve becomes unsustainable." — Source: [Towards Data Science]
  8. On hardware bottlenecks: "The entire industry is constrained by TSMC yields and NVIDIA pricing. That is a dangerous bottleneck for global innovation." — Source: [Charles O'Neill Personal Site]
  9. On pricing models: "Per-token pricing will eventually give way to per-task pricing. Users don't care about tokens; they care about solved problems." — Source: [The Neuron AI Newsletter]

Part 6: Cross-Domain Applications

  1. On legal AI: "In legal tech, hallucination isn't an annoyance; it is a direct failure. The constraints shape the engineering." — Source: [Harvey AI Engineering Blog]
  2. On medical diagnostics: "Applying AI to ophthalmology requires immense rigor. You aren't just categorizing pixels; you are impacting patient outcomes." — Source: [ARVO Conference Proceedings]
  3. On domain expertise: "An LLM is a reasoning engine, but it needs to be grounded in deep, highly specific domain expertise to be useful." — Source: [Towards Data Science]
  4. On AstroLLaMA: "Specialized models for fields like astronomy prove that a smaller, highly focused model can outperform a generalized behemoth." — Source: [NeurIPS Presentation]
  5. On retrieval systems: "RAG is a bridge. It allows generalized models to act as domain experts by giving them an open-book test." — Source: [Baseten Blog]
  6. On interdisciplinary research: "The best AI applications happen at the intersection of machine learning and a seemingly unrelated field." — Source: [Charles O'Neill Personal Site]
  7. On custom infrastructure: "Different domains require different inference profiles. A coding assistant needs low latency; a legal analyzer needs high context." — Source: [Harvey AI Engineering Blog]
  8. On evaluating models: "General benchmarks are becoming useless for specific domains. We need evaluations crafted by doctors and lawyers." — Source: [Towards Data Science]
  9. On safety in medicine: "AI in healthcare isn't about replacing the doctor. It is about providing the doctor with a perfectly attentive secondary reasoning system." — Source: [ARVO Conference Proceedings]
  10. On professional trust: "Professionals will only use models they can trust. Trust is built through consistent, interpretable results." — Source: [Charles O'Neill Personal Site]

Part 7: The Research Mindset and Academia

  1. On shifting focus: "Pivoting from economics and math to AI wasn't a rejection of my past study, but a realization of where the leverage was." — Source: [Tuckwell Talks Podcast]
  2. On academic pacing: "Academia operates on a timeline of years; the AI industry operates on a timeline of weeks. Bridging that gap is challenging." — Source: [Charles O'Neill Personal Site]
  3. On picking problems: "A researcher's most valuable asset is their taste in problems. Work on the bottleneck, not the distraction." — Source: [Towards Data Science]
  4. On inherited knowledge: "There is a strong case for inheriting nothing and building everything yourself to truly understand the underlying systems." — Source: [Charles O'Neill Personal Site]
  5. On publishing: "A paper is just a marketing document for an idea. The real test is whether the community adopts the code." — Source: [NeurIPS Presentation]
  6. On learning: "The field moves too fast for textbooks. You have to learn by reading preprints and reading source code." — Source: [Baseten Podcast]
  7. On the Tuckwell experience: "Being in a refreshingly different community forces you out of your specific academic silo." — Source: [Tuckwell Talks Podcast]
  8. On mentorship: "Good mentors don't give you answers; they point out the flaws in your questions." — Source: [Charles O'Neill Personal Site]
  9. On Oxford: "Studying at Oxford provides a historical grounding that acts as an anchor when you are working in a fast-moving, hype-driven field." — Source: [Tuckwell Talks Podcast]

Part 8: The Human Element in Artificial Intelligence

  1. On AI intent: "LLMs are really good at k-order thinking, but you still need to tell a language model you want to cure cancer before it can help you cure cancer." — Source: [Charles O'Neill Personal Site]
  2. On developer experience: "If the API is painful to use, no one will use it. Inference is ultimately a UX problem." — Source: [Baseten Blog]
  3. On user psychology: "Users are incredibly sensitive to latency. A 200ms delay changes the interaction from a conversation to a transaction." — Source: [Towards Data Science]
  4. On agency: "Models don't have agency; they have capabilities. The agency comes entirely from the infrastructure we wrap around them." — Source: [Charles O'Neill Personal Site]
  5. On societal impact: "Optimizing inference isn't just about saving money. It is about making intelligence cheap enough to be accessible to everyone." — Source: [NeurIPS Presentation]
  6. On AI safety: "Safety isn't an abstract philosophical problem. It is a measurable engineering constraint that must be built into the serving layer." — Source: [Harvey AI Engineering Blog]
  7. On bias: "The biases of a model are a direct reflection of its training data, but how those biases are exposed depends entirely on the deployment system." — Source: [Towards Data Science]
  8. On alignment: "Alignment is easier when models are transparent. Obfuscating the weights behind an API makes true alignment research nearly impossible." — Source: [Charles O'Neill Personal Site]
  9. On building tools: "We aren't building synthetic humans; we are building cognitive tools. Treating them as tools demystifies the hype." — Source: [Baseten Podcast]
  10. On the future: "The true impact of AI will be felt when the inference is so fast and so cheap that we stop noticing it is there." — Source: [The Neuron AI Newsletter]