Lessons from Jitendra Malik

# Lessons from Jitendra Malik

Jitendra Malik is a computer vision researcher and UC Berkeley professor who developed early methods for image segmentation and object recognition. He argues that artificial intelligence will not advance until it masters physical, sensorimotor interaction. This profile collects his technical insights on why machines must move to see, and see to move.

Part 1: The Sensorimotor Road to AI

On the Goal of AI: "Intelligence emerges in the interaction of an agent with an environment and as the result of sensorimotor activity." — Source: [Foundation Models Review]
On Action and Perception: "We see in order to move and we move in order to see." — Source: [Martin Meyerson Faculty Research Lecture]
On the Limits of Text: "Language is a thin layer on top of a massive foundation of sensorimotor understanding that we share with animals." — Source: [The Robot Brains Podcast]
On Physical Grounding: "You cannot build a complete artificial intelligence system without grounding it in the physical realities of gravity, friction, and object interaction." — Source: [Lex Fridman Podcast #110]
On True Intelligence: "The real action in AI is at the sensorimotor level, rather than manipulating symbols in a vacuum." — Source: [Martin Meyerson Faculty Research Lecture]
On Biological Precedents: "Animal intelligence developed over millions of years through movement and physical survival, which provides the blueprint for machine intelligence." — Source: [UC Berkeley EECS Faculty Page]
On Beyond Language: "A system trained solely on text lacks the tactile and spatial intuition required to truly comprehend the physical world." — Source: [Foundation Models Review]
On Embodiment: "Starting as an agent grounded in a physical space is a prerequisite for developing flexible, human-like intelligence." — Source: [Foundation Models Review]
On Vision's Purpose: "The visual cortex did not evolve to classify static images, but to guide motor control and navigation." — Source: [The Robot Brains Podcast]
On Future Benchmarks: "The next frontier of AI benchmarks must measure physical competence and sensorimotor adaptability, rather than static dataset accuracy." — Source: [Lex Fridman Podcast #110]

Part 2: Lessons from Child Development

On Learning Efficiency: "By the age of two or three, kids have become visual learning machines. They can tell the difference between cats and dogs with at most hundreds of examples." — Source: [Towards Data Science Interview]
On One-Shot Learning: "You take a child to the zoo and say 'that is a zebra.' That's all it takes. We still need a few thousand examples for our models." — Source: [Towards Data Science Interview]
On Alan Turing's Vision: "Instead of simulating the adult mind, we should focus on simulating the child's mind and subjecting it to education." — Source: [Martin Meyerson Faculty Research Lecture]
On the Scientist in the Crib: "Babies learn by conducting physical experiments with their environment by pushing, dropping, and grasping objects to understand physics." — Source: [The Robot Brains Podcast]
On Supervision: "Children do not learn through massive sets of labeled images; they learn through continuous, unsupervised physical interaction and sparse feedback." — Source: [Lex Fridman Podcast #110]
On Bridging Disciplines: "The insights of developmental psychology are important for AI progress but remain underutilized by the machine learning community." — Source: [Humans of AI Podcast]
On Play: "Unstructured play is a highly efficient mechanism for a child to build a stable model of worldly physics." — Source: [Martin Meyerson Faculty Research Lecture]
On Motor Milestones: "A child's progression from crawling to walking illustrates how morphological capability drives cognitive expansion." — Source: [The Robot Brains Podcast]
On Multi-Modal Learning: "Infants integrate sight, sound, and touch simultaneously, which allows them to learn faster than single-modality neural networks." — Source: [Lex Fridman Podcast #110]
On Active Learning: "A child moves their head to generate optical flow, actively gathering data rather than passively consuming it." — Source: [UC Berkeley EECS Faculty Page]

Part 3: The Three R's of Computer Vision

On Core Tasks: "The fundamental challenges of computer vision can be divided into recognition, reconstruction, and reorganization." — Source: [Embedded Vision Alliance Interview]
On Recognition: "Identifying what an object is forms only one pillar of visual understanding; it is insufficient on its own." — Source: [UC Berkeley EECS Faculty Page]
On Reconstruction: "The visual system must build a 3D model of the scene from 2D projections to facilitate navigation and manipulation." — Source: [Embedded Vision Alliance Interview]
On Reorganization: "Bottom-up perceptual grouping and top-down cognitive processes must interact to segment a scene effectively." — Source: [ICCV 2019 Keynote]
On Unifying the 3Rs: "Deep neural networks have allowed us to tackle recognition, reconstruction, and reorganization within a single mathematical framework." — Source: [Towards Data Science Interview]
On Early Aspirations: "I started 35 years ago as a grad student and this was my dream. And now it is there." — Source: [Towards Data Science Interview]
On Segmentation: "Normalized Cuts demonstrated that image segmentation could be treated rigorously as a graph partitioning problem." — Source: [UC Berkeley EECS Faculty Page]
On Edges and Regions: "Finding boundaries is intrinsically linked to grouping pixels into regions; you cannot solve one without the other." — Source: [ICCV 2019 Keynote]
On Context: "Object recognition improves dramatically when the system reconstructs the surrounding 3D context rather than viewing crops in isolation." — Source: [Lex Fridman Podcast #110]
On Video vs. Stills: "The Three R's must be extended into the temporal domain, as static images discard the essential dimension of time." — Source: [SlowFast Networks Paper]

Part 4: Real-World Robotics and Adaptation

On Demos: "Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics." — Source: [The Robot Brains Podcast]
On Environmental Change: "Our insight is that change is ubiquitous, so from day one, the RMA policy assumes that the environment will be new." — Source: [RMA Project Page]
On Forethought: "Adaptation is not an afterthought, but a forethought. That is the secret sauce for deploying robots outside the lab." — Source: [RMA Project Page]
On Simulation: "We can train control policies in simulation, but they must be designed to adapt to the unmodeled physics of the real world instantly." — Source: [Martin Meyerson Faculty Research Lecture]
On Hardware Constraints: "The limitations of physical actuators and sensors are as important to solve as the neural architecture itself." — Source: [The Robot Brains Podcast]
On Proprioception: "Knowing where your limbs are in space is a foundational sense that precedes complex visual planning." — Source: [Lex Fridman Podcast #110]
On Legged Locomotion: "Walking across uneven terrain requires constant micro-adjustments that cannot rely on slow visual processing alone." — Source: [RMA Project Page]
On Tactile Sensing: "The physics of grasping and manipulation require high-fidelity tactile feedback, an area where robotics still lags behind biology." — Source: [The Robot Brains Podcast]
On Closed-Loop Control: "Real robotics requires continuous, closed-loop action rather than open-loop plan generation." — Source: [Martin Meyerson Faculty Research Lecture]

Part 5: Deep Learning's Impact and Limits

On the Pre-Training Pioneer: "Indeed the R-CNN work from my group was a pioneer in using ImageNet pre-training before finetuning for the task of object detection." — Source: [Foundation Models Review]
On the Deep Learning Shift: "The transition to deep learning solved feature engineering but replaced it with the challenge of architecture design and data curation." — Source: [Lex Fridman Podcast #110]
On Computation: "The bitter lesson of AI is that methods using massive computation eventually outperform hand-crafted rules." — Source: [UC Berkeley EECS Faculty Page]
On Biologically Plausible Models: "Not all successful engineering solutions in deep learning reflect how the human brain actually operates." — Source: [Lex Fridman Podcast #110]
On Scaling: "Scaling up parameters and data works predictably well, but it does not guarantee a leap to general intelligence." — Source: [The Robot Brains Podcast]
On Architecture: "Designing separate pathways for spatial detail and temporal motion, as seen in SlowFast networks, mirrors the primate visual system." — Source: [SlowFast Networks Paper]
On Interpretability: "As models grow larger, we trade clear geometric understanding for empirical performance." — Source: [Embedded Vision Alliance Interview]
On Local Minima: "We spent decades worrying about local minima in neural networks, only to realize high-dimensional spaces offer plenty of escape routes." — Source: [Lex Fridman Podcast #110]
On Representation: "The true power of a neural network lies in learning intermediate representations that generalize across tasks." — Source: [UC Berkeley EECS Faculty Page]

Part 6: Rethinking Data and "Labels"

On the Addiction to Labels: "Labels are the opium of the machine learning researcher." — Source: [Towards Data Science Interview]
On Self-Supervision: "The future of learning relies on finding training signals inherent in the data itself rather than relying on human annotators." — Source: [Lex Fridman Podcast #110]
On the Limits of ImageNet: "A static dataset of nouns is a poor proxy for the verb-heavy, dynamic reality of human visual experience." — Source: [The Robot Brains Podcast]
On Annotation Costs: "We cannot scale robotics if every new environment requires thousands of hours of human labeling." — Source: [Martin Meyerson Faculty Research Lecture]
On Video as Data: "Video provides a continuous stream of physics lessons; the temporal sequence itself is a supervisory signal." — Source: [SlowFast Networks Paper]
On Cross-Modal Verification: "If what you see matches what you touch, the system generates its own ground truth without a human labeler." — Source: [Lex Fridman Podcast #110]
On Crowdsourcing: "While Mechanical Turk accelerated early computer vision, it constrained the field to tasks that are easily explained in text prompts." — Source: [Embedded Vision Alliance Interview]
On Rare Events: "The real-world distribution has a long tail; supervised learning fails on edge cases because we cannot label everything." — Source: [The Robot Brains Podcast]
On Embodied Data: "The best dataset is the physical world, queried actively by an agent through movement and interaction." — Source: [Martin Meyerson Faculty Research Lecture]

Part 7: Historical Context and Paradigms

On the Pre-Deep Learning Era: "Classical computer vision forced us to define geometry and optics explicitly, which built a mathematical rigor we still need today." — Source: [Embedded Vision Alliance Interview]
On Perona-Malik: "Anisotropic diffusion demonstrated that we could smooth an image to reduce noise while mathematically preserving its sharp edges." — Source: [UC Berkeley EECS Faculty Page]
On Shape Contexts: "Before deep features, capturing the spatial arrangement of edge points was a reliable way to match shapes under deformation." — Source: [Lex Fridman Podcast #110]
On Academic Trends: "The field oscillates between engineering-driven hacks and biologically inspired theories; the best work lives at the intersection." — Source: [Humans of AI Podcast]
On the AI Winter: "Researchers who worked through periods of low funding developed an appreciation for fundamental theory over quick demos." — Source: [The Robot Brains Podcast]
On Gibsonian Psychology: "James J. Gibson's theory of ecological optics (that perception is for action) fundamentally shaped how we approach modern robot vision." — Source: [Martin Meyerson Faculty Research Lecture]
On David Marr: "Marr’s framework of computational, algorithmic, and implementational levels remains a useful way to organize vision research." — Source: [Lex Fridman Podcast #110]
On the ImageNet Revolution: "The 2012 AlexNet breakthrough was the result of hardware, data, and algorithms finally aligning, rather than a completely new mathematical idea." — Source: [Embedded Vision Alliance Interview]
On Scientific Progress: "True progress in AI requires us to occasionally abandon paradigms we are comfortable with when they hit empirical ceilings." — Source: [Humans of AI Podcast]

Part 8: The Future of Intelligence

On Foundation Models: "Large language models are impressive text manipulators, but they lack the physical grounding necessary for common sense." — Source: [Foundation Models Review]
On AGI Timelines: "Predicting the arrival of Artificial General Intelligence is less useful than focusing on solving specific, grounded sensorimotor problems today." — Source: [Lex Fridman Podcast #110]
On Human Uniqueness: "What makes human intelligence unique is more than language; it relies on our ability to invent tools and manipulate our physical environment." — Source: [The Robot Brains Podcast]
On Evaluating Progress: "We should judge AI systems by their ability to handle novel, unscripted physical situations, not their performance on static test sets." — Source: [Martin Meyerson Faculty Research Lecture]
On Autonomous Driving: "Driving is fundamentally a sensorimotor task, and solving it requires models that deeply understand the physics and geometry of the road." — Source: [Lex Fridman Podcast #110]
On the Moravec Paradox: "High-level reasoning requires very little computation, but low-level sensorimotor skills require enormous computational resources." — Source: [The Robot Brains Podcast]
On the Role of Simulation: "Simulators will become the primary classrooms for AI, provided we can successfully bridge the gap to the physical world." — Source: [RMA Project Page]
On Generative AI: "Generating plausible images is a fun parlor trick, but the deeper goal is generating physical actions that succeed in the real world." — Source: [Lex Fridman Podcast #110]
On the Ultimate Test: "The final frontier for artificial intelligence is not passing a Turing test in a chat window, but building a robot that can clean a messy kitchen." — Source: [Martin Meyerson Faculty Research Lecture]

Lessons from Jitendra Malik

Part 1: The Sensorimotor Road to AI

Part 2: Lessons from Child Development

Part 3: The Three R's of Computer Vision

Part 4: Real-World Robotics and Adaptation

Part 5: Deep Learning's Impact and Limits

Part 6: Rethinking Data and "Labels"

Part 7: Historical Context and Paradigms

Part 8: The Future of Intelligence

Explore the surrounding system

Get the next notes and essays.

More profiles

Lessons from Evan Spiegel

Lessons from Jason Lemkin

Lessons from Brian Armstrong