Lessons from Chip Huyen

Chip Huyen is an engineer, author, and educator known for mapping out the practical realities of deploying machine learning systems into production. She wrote the widely read book Designing Machine Learning Systems and has become a leading voice in treating AI not as magic, but as a rigorous software engineering discipline. This collection organizes her insights across MLOps, system complexity, large language models, and technical hiring.

Part 1: The Engineering First Mindset

On the Engineering Premium: "If you have to choose between engineering and ML, choose engineering. It's easier for great engineers to pick up ML knowledge, but it's a lot harder for ML experts to become great engineers." — Source: [huyenchip.com]
On Building Tools: "If you become an engineer who builds great tools for ML, I'd forever be in your debt." — Source: [huyenchip.com]
On Data Science vs. ML Engineering: "The goal of data science is to generate business insights, whereas the goal of ML engineering is to turn data into products." — Source: [huyenchip.com]
On the New Stack: "I think for anyone who wants to build solutions to real-world problems, it's very likely that that person would need both traditional ML engineering and AI engineering." — Source: [AI Engineering]
On End-to-End Ownership: "Doesn't matter what you build, as long as you do it end to end: starting from an idea and deploying it so that a friend can use it." — Source: [huyenchip.com]
On Coding in ML: "In traditional software engineering, coding is the hard part, whereas in ML, coding is a small part of the battle." — Source: [Designing Machine Learning Systems]
On Simplicity: "Simplicity serves three purposes. First, simpler models are easier to deploy... Second, starting with something simple and adding more complex components step-by-step makes it easier to understand your model and debug it." — Source: [Designing Machine Learning Systems]
On Baselines: "The simplest model serves as a baseline to which you can compare your more complex models." — Source: [Designing Machine Learning Systems]

Part 2: The Reality of Production and Deployment

On the True Beginning: "Deployment changes the problem, it does not end it." — Source: [Designing Machine Learning Systems]
On Environment Shifts: "I used to think that an ML project is done after the model is deployed... Moving the model from the development environment to the production environment creates a whole new host of problems." — Source: [huyenchip.com]
On Research vs. Production: "In research, the priority is fast training and high throughput, while in production the priority is fast inference and low latency." — Source: [CS 329S Lecture Notes]
On Operationalizing: "Ops in MLOps comes from DevOps, short for Developments and Operations. To operationalize something means to bring it into production, which includes deploying, monitoring, and maintaining it." — Source: [huyenchip.com]
On Monitoring: "Automated monitoring and retraining are as critical as the initial development of an ML model." — Source: [Substack]
On Deployment Frequency: "Because data can change quickly, ML applications need faster development and deployment cycles. In many cases, you might have to deploy a new model every night." — Source: [huyenchip.com]
On System Degradation: "Without an intentional design to hold all the components together, a system will become a technical liability, prone to errors and quick to fall apart." — Source: [Designing Machine Learning Systems]
On Tooling Evolution: "The tutorial approach has been tremendously successful in getting models off the ground. However, the resulting systems tend to go outdated quickly because the tooling space is being innovated, business requirements change, and data distributions constantly shift." — Source: [CS 329S Lecture Notes]
On ML as a Magic Bullet: "Machine learning is not a magic tool that can solve all problems. Even for problems that ML can solve, ML solutions might not be the optimal solutions." — Source: [Designing Machine Learning Systems]

Part 3: Data Quality Over Algorithms

On Data Dominance: "For ML, applications developed with the most/best data win. Instead of focusing on improving deep learning algorithms, most companies will focus on improving their data." — Source: [huyenchip.com]
On Off-the-Shelf Models: "Most companies won't focus on developing ML models but will use an off-the-shelf model... applications developed with the most/best data win." — Source: [huyenchip.com]
On Model Assumptions: "It's important to think about what assumptions your model makes and whether our data satisfies those assumptions." — Source: [Designing Machine Learning Systems]
On Data Staring: "Manual inspection of data has probably the highest value-to-prestige ratio of any activity in machine learning." — Source: [huyenchip.com]
On the Value of 15 Minutes: "In every project I’ve worked on, staring at data for just 15 minutes usually gives me some insight that could save me hours of headaches." — Source: [huyenchip.com]
On The Data Catch-22: "Deep learning needs data, and to gather data, you might first need users. To avoid the catch-22, you might want to launch your product without deep learning to gather user data to train your system." — Source: [huyenchip.com]
On Data Dependency: "ML systems are unique because they are data dependent, and data varies wildly from one use case to the next." — Source: [Designing Machine Learning Systems]
On Problem Formulation: "As machine learning is driven more by data than by algorithms, for every formulation of the problem that you propose, you should also tell your interviewer what kind of data and how much data you need." — Source: [Introduction to Machine Learning Interviews]
On Distributed Training: "The most common parallelization method is data parallelism: you split your data on multiple machines, train your model on all of them, and accumulate gradients." — Source: [Designing Machine Learning Systems]

Part 4: System Complexity and Design

On System Uniqueness: "ML systems are both complex and unique. They are complex because they consist of many different components... and involve many different stakeholders." — Source: [Designing Machine Learning Systems]
On Systems Thinking: "Success in ML requires systems thinking, not just model building." — Source: [Designing Machine Learning Systems]
On the "It Depends" Answer: "My short answer to all these questions about ML design is always: 'It depends.' My long answers often involve hours of discussion to understand where the questioner comes from, what they're actually trying to achieve." — Source: [CS 329S Lecture Notes]
On Defining System Design: "Machine learning systems design is the process of defining the software architecture, infrastructure, algorithms, and data for a machine learning system to satisfy specified requirements." — Source: [CS 329S Lecture Notes]
On Using Heuristics: "Before wielding complex neural networks, try one of the many popular non-neural network approaches... If a simple heuristic can predict the next app accurately 70% of the time, any model you build has to outperform it significantly to justify the added complexity." — Source: [huyenchip.com]
On Starting Before ML: "Before ML: Use heuristics and rule-based systems." — Source: [Designing Machine Learning Systems]
On the Cost of Deep Learning: "Most real world problems might not even need deep learning... Deep learning models are often expensive to train and hard to explain. Most of the time, in production, they are only useful if their performance is unquestionably superior." — Source: [huyenchip.com]
On Finding the Fit: "There are many possible solutions to any given problem. The goal of system design is to find the one that best fits the requirements." — Source: [Designing Machine Learning Systems]

Part 5: Navigating LLMs and AI Agents

On the Demo-to-Production Gap: "It’s easy to make something cool with LLMs, but very hard to make something production-ready with them." — Source: [huyenchip.com]
On Prompt Rigor: "LLM limitations are exacerbated by a lack of engineering rigor in prompt engineering." — Source: [huyenchip.com]
On Shifting Constraints: "If you can describe a software, then AI can build it for you. The constraint moves upstream. The critical question no longer asks how to build, but what to build." — Source: [Shift Mag]
On Overfitting in LLMs: "Overfitting is still a thing in the LLM world... If our changes are based on evaluation results, there’s a risk of over-optimizing for our current set." — Source: [AI Engineering]
On Foundational Change: "The capability to build is commoditized, but the vision of what to build remains the bottleneck." — Source: [huyenchip.com]
On the Future of Motivation: "In an environment where execution becomes cheap, intrinsic motivation gains weight. If replication becomes trivial, the advantage may belong to those who decide what deserves to exist." — Source: [huyenchip.com]
On Managing Prompts: Prompts should be versioned and managed with the same strict controls as code, rather than treated as casual text inputs. — Source: [AI Engineering]
On Building LLM Apps: Developing applications with foundation models demands a new stack—one that bridges natural language ambiguity with programmatic determinism. — Source: [AI Engineering]
On AI Hype: "I believe that the AI hype is real and at some point, it has to calm down... However, I don't believe that ML will disappear. There might be fewer companies that can afford to do ML research, but there will be no shortage of companies that need tooling to bring ML into their production." — Source: [huyenchip.com]

Part 6: Evaluation and Metrics

On LLM Evaluation: "LLM evaluation is very, very hard." — Source: [huyenchip.com]
On Vibe Checks: "Vibing your way to production is the fastest path to architectural debt and user churn." — Source: [AI Engineering]
On Measurement: "You can't improve what you can't measure, and you can't measure what you don't understand." — Source: [huyenchip.com]
On Business vs. ML Metrics: "Most businesses don't care about ML metrics unless they can move business metrics. Therefore, if an ML system is built for a business, it must be motivated by business objectives, which need to be translated into ML objectives." — Source: [Designing Machine Learning Systems]
On AI Judges: "A common pitfall is forgoing human evaluation to rely entirely on AI judges." — Source: [AI Engineering]
On Evaluating the Evaluator: "AI judges must be evaluated and iterated over time, just like all other AI applications." — Source: [AI Engineering]
On Human Insight: "The teams with the best products I’ve seen all have human evaluation to supplement their automated evaluation." — Source: [AI Engineering]
On Performance vs. Business: "Model performance is not the same as business performance. Be clear about the problem you're trying to solve." — Source: [huyenchip.com]

Part 7: Real-Time Machine Learning

On the Shift to Real-Time: "Machine learning is going real-time. There are two levels: Level 1: Your ML system makes predictions in real-time (online predictions). Level 2: Your system can incorporate new data and update your model in real-time (continual learning)." — Source: [huyenchip.com]
On Batch vs. Streaming: "Switching from batch processing to stream processing requires a mental shift. With batch processing, you know when a job is done. With stream processing, it's never done." — Source: [huyenchip.com]
On Online Evaluation: "Online training demands online evaluation, but serving a model that hasn't been tested to users sounds like a recipe for disaster. Many companies do it anyway." — Source: [huyenchip.com]
On Evaluating Online: "Online evaluation matters as well as the offline evaluation when the goal is real-world performance." — Source: [Designing Machine Learning Systems]
On Latency Priorities: Real-time ML requires architectural decisions that prioritize the speed of inference over the throughput of training. — Source: [Designing Machine Learning Systems]
On Handling State: Managing stateful features in a streaming context introduces complex challenges around time-window aggregation and late-arriving data. — Source: [Designing Machine Learning Systems]
On Freshness: The value of a prediction often decays rapidly over time, making data freshness a critical metric for real-time systems. — Source: [Designing Machine Learning Systems]
On Feedback Loops: Real-time continuous learning systems must be designed to avoid feedback loops where the model's predictions inadvertently train future versions of the model to reinforce its own biases. — Source: [Designing Machine Learning Systems]
On Infrastructure Costs: The leap from batch to real-time inference is not just an algorithm change; it requires a fundamental and often costly overhaul of data infrastructure. — Source: [Designing Machine Learning Systems]

Part 8: Careers, Interviews, and Open Source

On Learning by Teaching: "I taught TensorFlow not because I was an expert, but because I wanted to become one. Nothing shows you what you don't know faster than teaching it." — Source: [huyenchip.com]
On Passion: "Don't think of passion as something you find. I think passion is something you cultivate." — Source: [huyenchip.com]
On Valuing Time: "No matter how well your job pays, you're still selling your time for less than what it's worth. Preserve your time to work for yourself." — Source: [huyenchip.com]
On Making Connections: "Get to know people for who they are, not just their jobs. Make friends, not connections." — Source: [huyenchip.com]
On Bad Interviews: "The majority of questions are bad. It means the company hasn't spent a lot of time thinking about their hiring pipeline, which leads to poor hiring decisions, which in turn can ruin the company." — Source: [Introduction to Machine Learning Interviews]
On Interview Strategy: "For hiring managers, it's crucial to assign each interviewer a set of skills to evaluate, so that different interviewers ask different questions and that collectively, they get a holistic picture." — Source: [Introduction to Machine Learning Interviews]
On Clarifying Questions: When faced with an ambiguous question, candidates should ask: "To better answer your question, is it to evaluate my understanding of [X]?" — Source: [Introduction to Machine Learning Interviews]
On the Reality of Open Source: "OSS means neither non-profit nor free. OSS maintenance is time-consuming and expensive." — Source: [huyenchip.com]
On Open Source Ethics: "One reason for OSS is transparency, collaboration, flexibility, and it just seems like the moral thing to do. Clients might not want to use a new tool without being able to see its source code." — Source: [huyenchip.com]

Lessons from Chip Huyen

Part 1: The Engineering First Mindset

Part 2: The Reality of Production and Deployment

Part 3: Data Quality Over Algorithms

Part 4: System Complexity and Design

Part 5: Navigating LLMs and AI Agents

Part 6: Evaluation and Metrics

Part 7: Real-Time Machine Learning

Part 8: Careers, Interviews, and Open Source

Get the next notes and essays.

More profiles

Lessons from Alex Sacerdote

Lessons from Paul Desmarais Jr.

Lessons from Michele Romanow

Explore the surrounding system