Lessons from Jonathan Ragan-Kelley

Jonathan Ragan-Kelley is an MIT computer scientist who builds compilers and domain-specific languages for high-performance systems. He co-created Halide, a programming language that advanced computational photography by decoupling algorithms from hardware schedules. The insights below outline his approach to squeezing maximum speed from modern processors as hardware scaling stalls.

Part 1: The Twilight of Moore's Law

On Hardware Divergence: "Modern computer hardware has increasingly diverged from the traditional programming models we rely on." — Source: [PLDI 2024 Keynote]
On the End of Free Speed: "As Moore’s Law slows down, performance gains can no longer be passively inherited from faster clocks." — Source: [MIT CSAIL Research Profile]
On Hardware Specialization: "The industry response to physical scaling limits is extreme specialization, requiring software to adapt to highly specific accelerator designs." — Source: [MIT News]
On Programmer Burden: "Without new abstractions, the burden of mapping complex algorithms onto fragmented hardware falls entirely on the application programmer." — Source: [Exo Language Manifesto]
On The Memory Wall: "Computation is often cheap; the true bottleneck in modern architectures is almost entirely dominated by data movement and memory bandwidth." — Source: [PLDI 2024 Keynote]
On Giving Hardware What It Wants: "Fast code is fundamentally about understanding the specific hunger of the target hardware and feeding it data in exactly the right shape." — Source: [PLDI 2024 Keynote]
On the Twilight Era: "We are operating in a transitional phase where the expectations set by decades of CPU scaling must be completely recalibrated for spatial and heterogeneous systems." — Source: [MIT EECS Faculty Page]
On Algorithmic Stagnation: "Focusing solely on algorithmic complexity ignores the reality that execution efficiency on physical silicon dictates practical application limits." — Source: [Halide ACM Paper]
On Tooling Deficits: "Our current compiler infrastructure was largely designed for an era of monolithic, predictable processors, leaving them ill-equipped for today's accelerators." — Source: [Exo Project Introduction]

Part 2: The Philosophy of Halide

On Separation of Concerns: "The core idea of Halide is explicitly separating the definition of the algorithm from its execution schedule." — Source: [Halide Language Specification]
On the Definition of a Schedule: "A schedule is the set of decisions about loop nesting, vectorization, parallelization, and memory allocation." — Source: [Halide Language Specification]
On Cross-Platform Portability: "By decoupling the algorithm, you can write the math once and then write entirely different schedules to target a mobile CPU versus a desktop GPU." — Source: [Halide Project Details]
On Avoiding Rewrites: "Before Halide, optimizing an image processing pipeline meant interleaving the math so deeply with hardware intrinsics that the original logic was unreadable and unportable." — Source: [SIGGRAPH Award Profile]
On Domain-Specific Languages: "General-purpose languages lack the domain knowledge required to make aggressive, safe optimizations for grid-based visual computing." — Source: [Halide ACM Paper]
On Pipeline Complexity: "Real-world photography pipelines contain dozens of stages; optimizing them by hand is a combinatorial nightmare without proper abstractions." — Source: [SIGGRAPH Award Profile]
On Search Spaces: "Exposing the schedule as a distinct construct turns optimization into a searchable space that autoschedulers can navigate algorithmically." — Source: [Halide Language Specification]
On Performance Scaling: "The order and granularity of execution and placement of data can easily change performance by an order of magnitude, even if the algorithm remains exactly the same." — Source: Systems Lunch
On Modularity: "Composability is lost when performance optimizations are baked directly into the functions; separating the schedule restores modularity." — Source: [Halide Project Details]
On Industry Adoption: "The success of Halide in systems like Android and Photoshop proves that developers want explicit control over hardware mapping if given the right tools." — Source: [Adobe Research Collaboration]

Part 3: Locality, Parallelism, and Redundancy

On the Fundamental Tension: "There is an inherent trade-off in visual computing between maximizing parallelism, preserving data locality, and minimizing redundant computation." — Source: Systems Lunch
On Recompute vs. Store: "Sometimes it is strictly faster to recalculate a value on the fly rather than fetching it from main memory, breaking traditional instruction-counting metrics." — Source: [Halide ACM Paper]
On Tile Sizing: "Choosing the right tile size is a balancing act between keeping data in the fastest cache layer and providing enough work to saturate the vector units." — Source: Systems Lunch
On Producer-Consumer Relationships: "Fusing loops can improve locality by keeping intermediate data in registers, but aggressive fusion often forces redundant work at the boundaries." — Source: [Halide Language Specification]
On Granularity: "The granularity of parallel execution dictates whether the overhead of dispatching work outweighs the benefits of concurrent processing." — Source: Systems Lunch
On Memory Hierarchy: "Data organization must account for the specific capacities of L1, L2, and L3 caches to prevent pipeline stalls." — Source: [MIT EECS Faculty Page]
On Synchronization Costs: "Barrier synchronization across large core counts can easily erase the latency benefits of parallelization if memory access patterns are jagged." — Source: [Halide ACM Paper]
On Data Layouts: "Whether data is stored as structure-of-arrays or array-of-structures dictates whether the hardware vector lanes will be fully utilized." — Source: [Halide Language Specification]
On Bottleneck Shifting: "Optimizing computation frequently just shifts the bottleneck to memory; optimizing memory shifts it to control flow. The schedule must balance all three." — Source: Systems Lunch

Part 4: Exo and User-Schedulable Languages

On Compiler Heuristics: "Relying on opaque compiler heuristics for high-performance kernels is a losing battle; the compiler simply cannot guess the optimal layout for every novel architecture." — Source: [Exo Project Introduction]
On User-Schedulable Languages: "A User-Schedulable Language gives programmers safe, high-level handles to explicitly control how the program is optimized." — Source: [Exo Project Introduction]
On Exocompilation: "Exocompilation externalizes target-specific code generation to user-level code rather than burying it inside the compiler." — Source: [Exo Language Manifesto]
On Safe Rewriting: "Exo allows engineers to apply scheduling operations with the guarantee that the semantics of the original program will not be altered." — Source: [Exo Project Introduction]
On Assembly Replacements: "Writing kernels in a USL aims to match or exceed the performance of hand-tuned assembly while retaining mathematical readability." — Source: [Exo Project Introduction]
On Hardware Accelerators: "Accelerators require highly specific instruction sequences; Exo allows users to define custom scheduling operations as external libraries to map to these instructions." — Source: [Exo Language Manifesto]
On Reusability: "By treating the schedule as a distinct, library-level construct, scheduling strategies can be shared and reused across different projects." — Source: [Exo Project Introduction]
On Performance Engineering: "The future of performance engineering requires giving the developer direct, ergonomic access to the underlying machine model." — Source: [HYTRADBOI 2025 Presentation]
On Reducing Boilerplate: "Exo 2 significantly reduces the amount of code required to achieve state-of-the-art performance by enabling customizable scheduling primitives." — Source: [Exo Project Introduction]

Part 5: Visual Computing and Graphics

On Computational Photography: "The smartphone camera is no longer a physical lens and sensor alone; it is an incredibly deep software pipeline." — Source: [SIGGRAPH Award Profile]
On Real-Time Constraints: "Visual computing imposes strict latency budgets; missing a frame deadline breaks the illusion of continuity." — Source: [MIT CSAIL Research Profile]
On Image Dimensions: "Scaling algorithms from 1080p to 4K or 8K requires entirely different memory management strategies, not just faster processors." — Source: [Halide ACM Paper]
On Stencil Computations: "Many image processing tasks are fundamentally stencil operations, where a pixel's value depends on a sliding window of its neighbors." — Source: [Halide Language Specification]
On Rendering Pipelines: "Modern rendering relies on massive, fine-grained data parallelism that traditional CPU scheduling cannot exploit efficiently." — Source: [SIGGRAPH Award Profile]
On Visual Quality: "Performance optimization in graphics is not just about speed; it directly enables higher visual fidelity and more complex light simulations within the same time budget." — Source: [SIGGRAPH Award Profile]
On Multi-Stage Processing: "A photo taken on a modern phone runs through noise reduction, tone mapping, and edge enhancement before the user ever sees it." — Source: [SIGGRAPH Award Profile]
On Edge Cases: "Handling boundary conditions in image tiles efficiently is a major source of complexity in hand-written C++ graphics code." — Source: [Halide Language Specification]
On Domain Expertise: "Graphics engineers know their algorithms best, but are often forced to become hardware experts; languages should let them focus on the pixels." — Source: [MIT CSAIL Research Profile]

Part 6: Verification and Tensor Programs

On Tensor Computation: "Tensors are the fundamental data structure of modern machine learning, and optimizing their flow is the primary challenge in AI systems." — Source: [MIT CSAIL Research Profile]
On Formal Verification: "When aggressive optimizations are applied to tensor pipelines, we need formal, mechanized proofs to ensure we haven't broken the math." — Source: [ATL Project Summary]
On Functional Languages: "A Tensor Language (ATL) utilizes functional paradigms to make equivalence checking of aggressive loop transformations mathematically tractable." — Source: [ATL Project Summary]
On Silent Failures: "A bug in a hardware scheduler might not crash the program; it might just subtly degrade the accuracy of a neural network." — Source: [ATL Project Summary]
On Automated Proofs: "By tying scheduling rewrites to formally verified rules, we can optimize deep learning kernels without risking silent data corruption." — Source: [ATL Project Summary]
On Program Synthesis: "We can increasingly use synthesis to auto-generate verified schedules that map high-level tensor math to specific accelerator backends." — Source: [MIT CSAIL Research Profile]
On Optimization Confidence: "Engineers hesitate to use complex loop transformations because they are hard to debug; guaranteed correctness removes that hesitation." — Source: [ATL Project Summary]
On Matrix Multiplication: "Even an operation as simple as matrix multiplication requires incredibly complex tiling and layout transformations to hit peak hardware utilization." — Source: [MIT CSAIL Research Profile]
On Scalable Verification: "The challenge of verification is scaling it up so that it can handle the massive dimensionality of modern transformer models." — Source: [ATL Project Summary]
On Trusting Compilers: "If we want developers to adopt highly experimental accelerator architectures, the compilation stack must guarantee absolute fidelity to the source algorithm." — Source: [MIT CSAIL Research Profile]

Part 7: Sparsity and Spatial Architectures

On Sparse Data: "Many physical simulations and machine learning models are fundamentally sparse; computing on empty space is a waste of energy." — Source: [Taichi Project Notes]
On Irregular Computation: "Traditional hardware loves dense grids, but reality often requires irregular data structures that break predictable memory access patterns." — Source: [Taichi Project Notes]
On Spatial Architectures: "Spatial architectures map logic directly to arrays of processing elements, requiring compilers to think about physical distance on the chip." — Source: [MIT CSAIL Research Profile]
On Locality in Sparsity: "Optimizing sparse computations is primarily an exercise in grouping non-zero elements to restore some degree of spatial locality." — Source: [Taichi Project Notes]
On the Taichi Project: "Systems like Taichi demonstrate that high-level abstractions can successfully orchestrate spatially sparse data structures on GPUs." — Source: [Taichi Project Notes]
On Mesh Computing: "Scientific computing on physical meshes demands automated differentiation systems that understand how to traverse graph-like structures efficiently." — Source: [MIT CSAIL Research Profile]
On Hardware Layouts: "In a spatial accelerator, moving data from one side of the chip to the other has a tangible latency cost that the compiler must model." — Source: [MIT CSAIL Research Profile]
On Sparse Tensors: "Developing mechanized algebras for sparse tensors allows the compiler to mathematically prove the correctness of complex sparsity-preserving transformations." — Source: [Taichi Project Notes]
On Compression Overheads: "The cost of decoding a sparse format must not exceed the compute savings gained by skipping the zeros." — Source: [Taichi Project Notes]

Part 8: The Future of High-Performance Systems

On Organization as a First-Class Concern: "The organization of computation—where data lives and when it is processed—must be elevated to a first-class programmatic concern." — Source: Systems Lunch
On the Role of the Programmer: "Programmers should focus on the 'what' of the algorithm and the high-level 'how' of the schedule, leaving the rote translation to the tooling." — Source: [PLDI 2024 Keynote]
On AI-Assisted Compilation: "Machine learning is becoming a viable tool for navigating the massive search spaces of possible schedules." — Source: [MIT CSAIL Research Profile]
On Heterogeneous Computing: "The future is not a faster CPU, but a chaotic assembly of CPUs, GPUs, NPUs, and custom ASICs working in fragile concert." — Source: [MIT EECS Faculty Page]
On Breaking the Monolith: "Monolithic compilation is giving way to modular systems where experts can plug in domain-specific optimization rules." — Source: [Exo Project Introduction]
On Cross-Disciplinary Teams: "High-performance systems now require deep collaboration between algorithm designers, compiler writers, and silicon architects." — Source: [MIT CSAIL Research Profile]
On Language Evolution: "Languages that do not provide mechanisms to reason about memory layout will eventually be relegated to control-flow orchestration." — Source: [HYTRADBOI 2025 Presentation]
On System Brittleness: "Hand-tuning C++ for a specific GPU creates brittle software that must be entirely rewritten when the next hardware generation arrives." — Source: [Halide Language Specification]
On Democratic Performance: "Providing high-level scheduling languages democratizes performance, allowing generalists to achieve speeds previously reserved for assembly ninjas." — Source: [PLDI 2024 Keynote]
On the Limits of Abstraction: "Abstractions must leak eventually; the goal of a good systems language is to control exactly how and where that leakage occurs." — Source: [MIT CSAIL Research Profile]

Lessons from Jonathan Ragan-Kelley

Lessons from Jonathan Ragan-Kelley

Part 1: The Twilight of Moore's Law

Part 2: The Philosophy of Halide

Part 3: Locality, Parallelism, and Redundancy

Part 4: Exo and User-Schedulable Languages

Part 5: Visual Computing and Graphics

Part 6: Verification and Tensor Programs

Part 7: Sparsity and Spatial Architectures

Part 8: The Future of High-Performance Systems

Explore the surrounding system

Get the next notes and essays.

More profiles

Lessons from Darren Farber

Lessons from Vlad Barbalat

Lessons from Kareem Amin