Visual summary of operating lessons from Adrian Cockcroft.

Lessons from Adrian Cockcroft

Adrian Cockcroft shaped modern cloud infrastructure at Sun Microsystems, Netflix, and AWS by proving microservices and chaos engineering could work at massive scale. This collection gathers his practical observations on building fast, resilient systems that account for human behavior.

Part 1: Performance and System Health

  1. On Performance Management: "Manage Performance, or It Will Manage You!" — Source: [UAKOM]
  2. On Wait States: "I don't bother with waiting for I/O. It's too misleading on multiprocessor systems." — Source: [UAKOM]
  3. On Idle Processing: "usr + sys + idle = 100. Idle time is merely a redundant remainder; focus on the work being done." — Source: [UAKOM]
  4. On Diagnosis Automation: "If the system runs out of swap space, output: +++OUT OF CHEESE ERROR, PLEASE RELOAD UNIVERSE AND RESTART+++" — Source: [Medium]
  5. On Disk Constraints: "Monitor disks based on service time rather than just utilization. Over 30ms is a bottleneck; under 10ms is excellent." — Source: [UAKOM]
  6. On Memory Famine: "Watch the page scan rate. If the kernel is scanning for free pages, you have a memory shortage, regardless of reported free RAM." — Source: [UAKOM]
  7. On Correctness: "If it doesn't work, it doesn't matter how fast it doesn't work." — Source: [UAKOM]
  8. On Premature Tuning: "The first rule of performance tuning is: don't do it. The second rule (for experts only) is: don't do it yet." — Source: [UAKOM]
  9. On Network Overload: "If collisions are more than 5% of output packets, your network is overloaded." — Source: [UAKOM]

Part 2: The Evolution of Cloud Architecture

  1. On Legacy Constraints: "Incremental change may be good theory, but in practice you have to have a big enough stick to hit everybody with to make everything move at once." — Source: [a16z]
  2. On Existence Proofs: "With technology, there are lots of ideas running around your head, and it's very rare that someone can capture the thought process; working prototypes provide that existence proof." — Source: [The Serverless Edge]
  3. On Component Selection: "Wardley Mapping helps identify the better bits of Lego you should use to build systems rather than reinventing low-level infrastructure." — Source: [Planview]
  4. On Technological Commoditization: "You can finish building your app in less time than another team can decide on how to configure Kubernetes." — Source: [Flow Framework]
  5. On Abstraction: "Life is complicated… but we use simple abstractions to deal with it." — Source: [Planview]
  6. On Situational Awareness: "Understand where your components sit on the evolution axis so you don't waste time polishing the brass on a sinking ship." — Source: [Planview]
  7. On Undifferentiated Lifting: "Hardware provisioning is undifferentiated heavy lifting – replace it with IaaS." — Source: [InfoQ]
  8. On Building Value: "Spend your engineering hours only on the code that makes your product unique to your customers." — Source: [InfoQ]
  9. On Leveraging Scale: "Go and get your open source from the web scale companies that are investing and have the best people working for them solving distributed system problems." — Source: [InfoQ]

Part 3: Microservices and Decentralization

  1. On Definition: "A loosely coupled service-oriented architecture with bounded contexts." — Source: [F5]
  2. On False Microservices: "If you have a bunch of small, specialized services but still have to update them together, they're not microservices." — Source: [F5]
  3. On Loose Coupling: "Loosely coupled means that you can update the services independently; updating one service doesn't require changing any other services." — Source: [F5]
  4. On Emergent Design: "We didn't design the architecture top-down; we discovered and formalized the architecture that emerged as engineers built what they needed." — Source: [Medium]
  5. On Monolith Decay: "Most monoliths are tangled balls of mud... as you get a large team of developers on a monolithic app, it gets harder and harder to build." — Source: [Medium]
  6. On Business Priorities: "Microservices are arguably more about business concerns than about architectural best practices." — Source: [Medium]
  7. On Continuous Delivery: "When you look at what it takes to do continuous delivery, you just end up with something that looks like microservices because you have to be able to break things into small chunks." — Source: [Medium]
  8. On API Exclusivity: "We don't have configuration, it's all program, it's all API driven." — Source: [YouTube]
  9. On Security Zoning: "When you break things into microservices, you've got the ability to have some parts of your system be low-security risk and other parts be high-security risk." — Source: [a16z]
  10. On Progression: "The natural evolution of architecture goes from monoliths to microservices to functions." — Source: [Amazon]

Part 4: Speed and Open Source Strategy

  1. On the Primary Metric: "Speed wins." — Source: [The New Stack]
  2. On Executive Desires: "There isn't any executive that wants his company to be slower at product development." — Source: [The New Stack]
  3. On Synchronization: "If you look at how you really speed things up, you have to take the hand-offs out of the process." — Source: [AWS]
  4. On Technical Debt: "Project-based teams accrue technical debt; product or platform-based teams pay it down." — Source: [Not a Factory Anymore]
  5. On Proprietary vs Open: "What's become apparent in the last few years is that the open source products are the most scalable and dependable products." — Source: [The New Stack]
  6. On Malleability: "The open source products were much more malleable. Netflix put people on the project and changed it until it did what we wanted." — Source: [The New Stack]
  7. On Community Value: "At Netflix, we found that contributing to and using open source projects allowed us to innovate faster and build on the collective knowledge of the community." — Source: [Medium]
  8. On Developer Geography: "The best things you can get now are free. And they are built by communities and the best engineers don't work at those enterprise companies. They work for end users." — Source: [InfoQ]
  9. On Recruitment Strategy: "I was actually spending more and more time outbound at Netflix promoting the open source projects... helping people externally creates a virtuous cycle for hiring." — Source: [InfoQ]
  10. On Ecosystem Strategy: "If you need a feature, build it into the open source project so the community maintains it for you." — Source: [InfoQ]

Part 5: Chaos Engineering and Resilience

  1. On the Core Concept: "Chaos Engineering is an experiment to ensure that the impact of failures is mitigated." — Source: [Spreaker]
  2. On Failure Assumptions: "Failures are going to happen all the time. They do happen all the time. It’s when your mitigation doesn’t work that you have an outage." — Source: [YouTube]
  3. On Preventative Thinking: "You can’t legislate against failure; focus on fast detection and response." — Source: [Gremlin]
  4. On the Switch Principle: "If you're trying to switch between two things, the switch itself needs to be an order of magnitude more reliable than the things it's switching between." — Source: [YouTube]
  5. On Redundancy Risks: "Adding redundancy can actually decrease reliability if the failover mechanism is fragile." — Source: [YouTube]
  6. On Testing Pipelines: "Move away from annual disaster recovery tests toward continuously tested resilience integrated into the delivery pipeline." — Source: [Medium]
  7. On Human Chaos: "Chaos engineering is not limited to infrastructure or software but can also be applied to humans during Game Days." — Source: [Spreaker]
  8. On Breaking Systems: "The goal of chaos engineering is not to break things, but to build confidence in how your systems actually work." — Source: [Antstack]
  9. On Root Causes: "The last strand that breaks is not the cause of a failure!" — Source: [AWS]
  10. On Recovery Cascades: "Work amplification and hysteresis can cause a system to get stuck in a bad state where it won't recover even after the initial trigger is removed." — Source: [Gremlin]

Part 6: Developer Responsibility and NoOps

  1. On Operations Redesign: "DevOps is a reorg." — Source: [AWS Static]
  2. On Platform Automation: "There is no ops organization involved in running our cloud, no need for the developers to interact with ops people to get things done." — Source: [PCWorld]
  3. On Terminology: "I think that's different to the way most DevOps places run... so it needs its own name, NoOps." — Source: [PCWorld]
  4. On Production Ownership: "You build it, you run it." — Source: [USENIX]
  5. On Training Focus: "We taught our developers to operate rather than teaching our operators to develop." — Source: [YouTube]
  6. On Friction Removal: "The goal is to remove the meetings between developers and operations, allowing for extreme speed and agility." — Source: [IAM On Demand]
  7. On Developer Gatekeeping: "Get out of the way of your developers or lose them to someone who will." — Source: [UAKOM]
  8. On Observability: "Observability is the foundation of resilience; you cannot run a chaos experiment if you cannot see the steady state of your system." — Source: [Gremlin]
  9. On Self-Service: "Instead of filing tickets, developers must use automated tools to deploy code and manage capacity in seconds." — Source: [USENIX]

Part 7: Organizational Culture and Trust

  1. On Incentives: "You get the culture you pay for." — Source: [Driftboat Dave]
  2. On Hiring vs Environment: "We hired them from you, and got out of their way." — Source: [Not a Factory Anymore]
  3. On Default Culture: "If you are building a startup, be extremely intentional about the culture... because culture will just happen to you." — Source: [Wamda]
  4. On Organizational Design: "High trust, low process, no hand-offs." — Source: [The Burning Monk]
  5. On Team Boundaries: "High trust and high cohesion within the team, and low trust across the teams." — Source: [The Burning Monk]
  6. On Decision Making: "Don't move information to authority, move authority to the information." — Source: [Wamda]
  7. On Leadership Approach: "Inspire them with a purpose, and then get out of their way." — Source: [Wamda]
  8. On Data over Opinion: "The plural of anecdote is not data." — Source: [Amazon]
  9. On Availability Theater: "Availability Theater is the practice of building complex failover systems that are never actually tested." — Source: [UAKOM]

Part 8: Serverless and Sustainability

  1. On the Serverless Reset: "Serverless app development isn't an incremental change, it's a complete reset in terms of the speed and cost and scalability of what can be built." — Source: [Cloudfront]
  2. On the True Impact of Cloud: "If the way we all architect makes it easier to deploy less total capacity, then we are helping to reduce the carbon footprint of the computing industry." — Source: [Medium]
  3. On Optimization Fallacies: "Optimizing your own workloads by moving them around without reducing total capacity just increases the carbon footprint over-all for the computing industry." — Source: [Medium]
  4. On Supply Chain Accountability: "It's not enough to have accurate measurement of carbon emissions. Organizations must also ensure their supply chain adheres to the same standards." — Source: [Barry O'Reilly]
  5. On Hidden Carbon: "Just because the carbon emissions aren't charged to your account, it doesn't make them go away." — Source: [Medium]
  6. On Industry Cooperation: "Working together is the only way to fight the climate crisis." — Source: [Amazon]
  7. On Latency in Reporting: "The challenge is that accurate data isn't available immediately; cloud providers currently only provide monthly carbon data, which isn't useful for workload optimization." — Source: [Medium]
  8. On Efficient Defaults: "Minimize your total footprint where possible, and use the spot market price as a guide for when to run workloads." — Source: [Medium]
  9. On the Cost of Action: "The first thing to understand is that doing things consumes time and energy and has a carbon footprint of its own." — Source: [Medium]