Visual summary of operating lessons from Jonathan Ragan-Kelley.

Lessons from Jonathan Ragan-Kelley

Jonathan Ragan-Kelley is an MIT computer scientist who builds compilers and domain-specific languages for high-performance systems. He co-created Halide, a programming language that advanced computational photography by decoupling algorithms from hardware schedules. The insights below outline his approach to squeezing maximum speed from modern processors as hardware scaling stalls.

Part 1: The Twilight of Moore's Law

  1. On Hardware Divergence: "Modern computer hardware has increasingly diverged from the traditional programming models we rely on." — Source: [PLDI 2024 Keynote]
  2. On the End of Free Speed: "As Moore’s Law slows down, performance gains can no longer be passively inherited from faster clocks." — Source: [MIT CSAIL Research Profile]
  3. On Hardware Specialization: "The industry response to physical scaling limits is extreme specialization, requiring software to adapt to highly specific accelerator designs." — Source: [MIT News]
  4. On Programmer Burden: "Without new abstractions, the burden of mapping complex algorithms onto fragmented hardware falls entirely on the application programmer." — Source: [Exo Language Manifesto]
  5. On The Memory Wall: "Computation is often cheap; the true bottleneck in modern architectures is almost entirely dominated by data movement and memory bandwidth." — Source: [PLDI 2024 Keynote]
  6. On Giving Hardware What It Wants: "Fast code is fundamentally about understanding the specific hunger of the target hardware and feeding it data in exactly the right shape." — Source: [PLDI 2024 Keynote]
  7. On the Twilight Era: "We are operating in a transitional phase where the expectations set by decades of CPU scaling must be completely recalibrated for spatial and heterogeneous systems." — Source: [MIT EECS Faculty Page]
  8. On Algorithmic Stagnation: "Focusing solely on algorithmic complexity ignores the reality that execution efficiency on physical silicon dictates practical application limits." — Source: [Halide ACM Paper]
  9. On Tooling Deficits: "Our current compiler infrastructure was largely designed for an era of monolithic, predictable processors, leaving them ill-equipped for today's accelerators." — Source: [Exo Project Introduction]

Part 2: The Philosophy of Halide

  1. On Separation of Concerns: "The core idea of Halide is explicitly separating the definition of the algorithm from its execution schedule." — Source: [Halide Language Specification]
  2. On the Definition of a Schedule: "A schedule is the set of decisions about loop nesting, vectorization, parallelization, and memory allocation." — Source: [Halide Language Specification]
  3. On Cross-Platform Portability: "By decoupling the algorithm, you can write the math once and then write entirely different schedules to target a mobile CPU versus a desktop GPU." — Source: [Halide Project Details]
  4. On Avoiding Rewrites: "Before Halide, optimizing an image processing pipeline meant interleaving the math so deeply with hardware intrinsics that the original logic was unreadable and unportable." — Source: [SIGGRAPH Award Profile]
  5. On Domain-Specific Languages: "General-purpose languages lack the domain knowledge required to make aggressive, safe optimizations for grid-based visual computing." — Source: [Halide ACM Paper]
  6. On Pipeline Complexity: "Real-world photography pipelines contain dozens of stages; optimizing them by hand is a combinatorial nightmare without proper abstractions." — Source: [SIGGRAPH Award Profile]
  7. On Search Spaces: "Exposing the schedule as a distinct construct turns optimization into a searchable space that autoschedulers can navigate algorithmically." — Source: [Halide Language Specification]
  8. On Performance Scaling: "The order and granularity of execution and placement of data can easily change performance by an order of magnitude, even if the algorithm remains exactly the same." — Source: Systems Lunch
  9. On Modularity: "Composability is lost when performance optimizations are baked directly into the functions; separating the schedule restores modularity." — Source: [Halide Project Details]
  10. On Industry Adoption: "The success of Halide in systems like Android and Photoshop proves that developers want explicit control over hardware mapping if given the right tools." — Source: [Adobe Research Collaboration]

Part 3: Locality, Parallelism, and Redundancy

  1. On the Fundamental Tension: "There is an inherent trade-off in visual computing between maximizing parallelism, preserving data locality, and minimizing redundant computation." — Source: Systems Lunch
  2. On Recompute vs. Store: "Sometimes it is strictly faster to recalculate a value on the fly rather than fetching it from main memory, breaking traditional instruction-counting metrics." — Source: [Halide ACM Paper]
  3. On Tile Sizing: "Choosing the right tile size is a balancing act between keeping data in the fastest cache layer and providing enough work to saturate the vector units." — Source: Systems Lunch
  4. On Producer-Consumer Relationships: "Fusing loops can improve locality by keeping intermediate data in registers, but aggressive fusion often forces redundant work at the boundaries." — Source: [Halide Language Specification]
  5. On Granularity: "The granularity of parallel execution dictates whether the overhead of dispatching work outweighs the benefits of concurrent processing." — Source: Systems Lunch
  6. On Memory Hierarchy: "Data organization must account for the specific capacities of L1, L2, and L3 caches to prevent pipeline stalls." — Source: [MIT EECS Faculty Page]
  7. On Synchronization Costs: "Barrier synchronization across large core counts can easily erase the latency benefits of parallelization if memory access patterns are jagged." — Source: [Halide ACM Paper]
  8. On Data Layouts: "Whether data is stored as structure-of-arrays or array-of-structures dictates whether the hardware vector lanes will be fully utilized." — Source: [Halide Language Specification]
  9. On Bottleneck Shifting: "Optimizing computation frequently just shifts the bottleneck to memory; optimizing memory shifts it to control flow. The schedule must balance all three." — Source: Systems Lunch

Part 4: Exo and User-Schedulable Languages

  1. On Compiler Heuristics: "Relying on opaque compiler heuristics for high-performance kernels is a losing battle; the compiler simply cannot guess the optimal layout for every novel architecture." — Source: [Exo Project Introduction]
  2. On User-Schedulable Languages: "A User-Schedulable Language gives programmers safe, high-level handles to explicitly control how the program is optimized." — Source: [Exo Project Introduction]
  3. On Exocompilation: "Exocompilation externalizes target-specific code generation to user-level code rather than burying it inside the compiler." — Source: [Exo Language Manifesto]
  4. On Safe Rewriting: "Exo allows engineers to apply scheduling operations with the guarantee that the semantics of the original program will not be altered." — Source: [Exo Project Introduction]
  5. On Assembly Replacements: "Writing kernels in a USL aims to match or exceed the performance of hand-tuned assembly while retaining mathematical readability." — Source: [Exo Project Introduction]
  6. On Hardware Accelerators: "Accelerators require highly specific instruction sequences; Exo allows users to define custom scheduling operations as external libraries to map to these instructions." — Source: [Exo Language Manifesto]
  7. On Reusability: "By treating the schedule as a distinct, library-level construct, scheduling strategies can be shared and reused across different projects." — Source: [Exo Project Introduction]
  8. On Performance Engineering: "The future of performance engineering requires giving the developer direct, ergonomic access to the underlying machine model." — Source: [HYTRADBOI 2025 Presentation]
  9. On Reducing Boilerplate: "Exo 2 significantly reduces the amount of code required to achieve state-of-the-art performance by enabling customizable scheduling primitives." — Source: [Exo Project Introduction]

Part 5: Visual Computing and Graphics

  1. On Computational Photography: "The smartphone camera is no longer a physical lens and sensor alone; it is an incredibly deep software pipeline." — Source: [SIGGRAPH Award Profile]
  2. On Real-Time Constraints: "Visual computing imposes strict latency budgets; missing a frame deadline breaks the illusion of continuity." — Source: [MIT CSAIL Research Profile]
  3. On Image Dimensions: "Scaling algorithms from 1080p to 4K or 8K requires entirely different memory management strategies, not just faster processors." — Source: [Halide ACM Paper]
  4. On Stencil Computations: "Many image processing tasks are fundamentally stencil operations, where a pixel's value depends on a sliding window of its neighbors." — Source: [Halide Language Specification]
  5. On Rendering Pipelines: "Modern rendering relies on massive, fine-grained data parallelism that traditional CPU scheduling cannot exploit efficiently." — Source: [SIGGRAPH Award Profile]
  6. On Visual Quality: "Performance optimization in graphics is not just about speed; it directly enables higher visual fidelity and more complex light simulations within the same time budget." — Source: [SIGGRAPH Award Profile]
  7. On Multi-Stage Processing: "A photo taken on a modern phone runs through noise reduction, tone mapping, and edge enhancement before the user ever sees it." — Source: [SIGGRAPH Award Profile]
  8. On Edge Cases: "Handling boundary conditions in image tiles efficiently is a major source of complexity in hand-written C++ graphics code." — Source: [Halide Language Specification]
  9. On Domain Expertise: "Graphics engineers know their algorithms best, but are often forced to become hardware experts; languages should let them focus on the pixels." — Source: [MIT CSAIL Research Profile]

Part 6: Verification and Tensor Programs

  1. On Tensor Computation: "Tensors are the fundamental data structure of modern machine learning, and optimizing their flow is the primary challenge in AI systems." — Source: [MIT CSAIL Research Profile]
  2. On Formal Verification: "When aggressive optimizations are applied to tensor pipelines, we need formal, mechanized proofs to ensure we haven't broken the math." — Source: [ATL Project Summary]
  3. On Functional Languages: "A Tensor Language (ATL) utilizes functional paradigms to make equivalence checking of aggressive loop transformations mathematically tractable." — Source: [ATL Project Summary]
  4. On Silent Failures: "A bug in a hardware scheduler might not crash the program; it might just subtly degrade the accuracy of a neural network." — Source: [ATL Project Summary]
  5. On Automated Proofs: "By tying scheduling rewrites to formally verified rules, we can optimize deep learning kernels without risking silent data corruption." — Source: [ATL Project Summary]
  6. On Program Synthesis: "We can increasingly use synthesis to auto-generate verified schedules that map high-level tensor math to specific accelerator backends." — Source: [MIT CSAIL Research Profile]
  7. On Optimization Confidence: "Engineers hesitate to use complex loop transformations because they are hard to debug; guaranteed correctness removes that hesitation." — Source: [ATL Project Summary]
  8. On Matrix Multiplication: "Even an operation as simple as matrix multiplication requires incredibly complex tiling and layout transformations to hit peak hardware utilization." — Source: [MIT CSAIL Research Profile]
  9. On Scalable Verification: "The challenge of verification is scaling it up so that it can handle the massive dimensionality of modern transformer models." — Source: [ATL Project Summary]
  10. On Trusting Compilers: "If we want developers to adopt highly experimental accelerator architectures, the compilation stack must guarantee absolute fidelity to the source algorithm." — Source: [MIT CSAIL Research Profile]

Part 7: Sparsity and Spatial Architectures

  1. On Sparse Data: "Many physical simulations and machine learning models are fundamentally sparse; computing on empty space is a waste of energy." — Source: [Taichi Project Notes]
  2. On Irregular Computation: "Traditional hardware loves dense grids, but reality often requires irregular data structures that break predictable memory access patterns." — Source: [Taichi Project Notes]
  3. On Spatial Architectures: "Spatial architectures map logic directly to arrays of processing elements, requiring compilers to think about physical distance on the chip." — Source: [MIT CSAIL Research Profile]
  4. On Locality in Sparsity: "Optimizing sparse computations is primarily an exercise in grouping non-zero elements to restore some degree of spatial locality." — Source: [Taichi Project Notes]
  5. On the Taichi Project: "Systems like Taichi demonstrate that high-level abstractions can successfully orchestrate spatially sparse data structures on GPUs." — Source: [Taichi Project Notes]
  6. On Mesh Computing: "Scientific computing on physical meshes demands automated differentiation systems that understand how to traverse graph-like structures efficiently." — Source: [MIT CSAIL Research Profile]
  7. On Hardware Layouts: "In a spatial accelerator, moving data from one side of the chip to the other has a tangible latency cost that the compiler must model." — Source: [MIT CSAIL Research Profile]
  8. On Sparse Tensors: "Developing mechanized algebras for sparse tensors allows the compiler to mathematically prove the correctness of complex sparsity-preserving transformations." — Source: [Taichi Project Notes]
  9. On Compression Overheads: "The cost of decoding a sparse format must not exceed the compute savings gained by skipping the zeros." — Source: [Taichi Project Notes]

Part 8: The Future of High-Performance Systems

  1. On Organization as a First-Class Concern: "The organization of computation—where data lives and when it is processed—must be elevated to a first-class programmatic concern." — Source: Systems Lunch
  2. On the Role of the Programmer: "Programmers should focus on the 'what' of the algorithm and the high-level 'how' of the schedule, leaving the rote translation to the tooling." — Source: [PLDI 2024 Keynote]
  3. On AI-Assisted Compilation: "Machine learning is becoming a viable tool for navigating the massive search spaces of possible schedules." — Source: [MIT CSAIL Research Profile]
  4. On Heterogeneous Computing: "The future is not a faster CPU, but a chaotic assembly of CPUs, GPUs, NPUs, and custom ASICs working in fragile concert." — Source: [MIT EECS Faculty Page]
  5. On Breaking the Monolith: "Monolithic compilation is giving way to modular systems where experts can plug in domain-specific optimization rules." — Source: [Exo Project Introduction]
  6. On Cross-Disciplinary Teams: "High-performance systems now require deep collaboration between algorithm designers, compiler writers, and silicon architects." — Source: [MIT CSAIL Research Profile]
  7. On Language Evolution: "Languages that do not provide mechanisms to reason about memory layout will eventually be relegated to control-flow orchestration." — Source: [HYTRADBOI 2025 Presentation]
  8. On System Brittleness: "Hand-tuning C++ for a specific GPU creates brittle software that must be entirely rewritten when the next hardware generation arrives." — Source: [Halide Language Specification]
  9. On Democratic Performance: "Providing high-level scheduling languages democratizes performance, allowing generalists to achieve speeds previously reserved for assembly ninjas." — Source: [PLDI 2024 Keynote]
  10. On the Limits of Abstraction: "Abstractions must leak eventually; the goal of a good systems language is to control exactly how and where that leakage occurs." — Source: [MIT CSAIL Research Profile]