Visual summary of operating lessons from Jim Fan.

Lessons from Jim Fan

Jim Fan leads NVIDIA's work on generally capable embodied agents: systems that learn across games, simulations, humanoid robots, dexterous hands, and world models. His work is useful because it treats robotics as a data and systems problem, not a demo problem. The through-line is clear: use open-ended worlds to collect behavior, use language and code to bootstrap skills, then use simulation and human video to move those skills into the physical world.

Part 1: Start With Open-Ended Worlds

  1. On choosing Minecraft: MineDojo treated Minecraft as a serious research environment because it contains long-horizon tasks, diverse objects, open-ended goals, and a giant external knowledge base. — Source: MineDojo arXiv
  2. On internet-scale embodied data: MineDojo's core bet was that an agent could learn from Minecraft videos, wiki pages, and community discussions instead of relying only on handcrafted reward functions. — Source: NVIDIA MineDojo Blog
  3. On broad task coverage: The MineDojo benchmark was designed around thousands of open-ended tasks rather than a small set of narrow game objectives. — Source: MineDojo Project
  4. On research prizes: MineDojo won a NeurIPS 2022 Outstanding Datasets and Benchmarks Paper Award, which helped validate Minecraft as more than a toy domain. — Source: NVIDIA NeurIPS Awards
  5. On learning from culture: The important move was not only the simulator; it was connecting the simulator to the surrounding human culture of tutorials, walkthroughs, forum posts, and videos. — Source: MineDojo Paper
  6. On embodied AI before robots: Fan's path shows that virtual worlds can be the cheapest place to debug embodied intelligence before moving to expensive hardware. — Source: Jim Fan Website
  7. On open tools: MineDojo shipped a simulation suite, knowledge bases, algorithms, and pretrained models, reinforcing Fan's bias toward reusable research infrastructure. — Source: NVIDIA SRL MineDojo

Part 2: Turn Language Models Into Skill Builders

  1. On Voyager's claim: Voyager was framed as the first LLM-powered embodied lifelong learning agent in Minecraft, continuously exploring, acquiring skills, and making discoveries without human intervention. — Source: Voyager Project
  2. On code as action: Voyager's key move was using GPT-4 to write executable code, turning language reasoning into actions inside the game. — Source: Voyager arXiv
  3. On the skill library: Voyager stores successful programs in a persistent skill library, so progress becomes accumulated code rather than a temporary chat transcript. — Source: Voyager Project
  4. On self-debugging: The agent uses environment feedback and execution errors to revise its own code, making debugging part of the learning loop. — Source: Voyager arXiv
  5. On curriculum design: Voyager's automatic curriculum pushes it toward novelty, so it keeps searching for new items and capabilities instead of repeating solved tasks. — Source: Voyager Project
  6. On public impact: The Voyager work stood out because it gave a concrete example of an LLM agent that could accumulate capabilities over time, not merely answer prompts. — Source: WIRED
  7. On the practical lesson: Fan's agent work suggests that durable memory should often be a tool library, a codebase, or a world model, not a longer prompt. — Source: Voyager arXiv

Part 3: Make Robotics Promptable

  1. On VIMA's premise: VIMA showed that robot manipulation tasks could be expressed through multimodal prompts that interleave text and visual tokens. — Source: VIMA arXiv
  2. On general robot manipulation: VIMA treated manipulation as a sequence-modeling problem, using a transformer agent that reads prompts and autoregressively outputs motor actions. — Source: VIMA Project
  3. On prompt interfaces: The lesson from VIMA is that robotics interfaces can look more like instructions and examples than fixed controllers. — Source: VIMA arXiv
  4. On model scaling: VIMA's result matters because it connected robot manipulation to the same prompt-based learning pattern that had already worked in language. — Source: VIMA Paper PDF
  5. On early foundation models: Fan describes VIMA as one of the early multimodal foundation models for robot manipulation, placing it in the lineage before today's humanoid models. — Source: Jim Fan Website
  6. On unifying perception and action: The key architectural lesson is to stop treating vision, language, and motor control as separate products and instead train a single policy surface. — Source: VIMA Project
  7. On task diversity: A robot foundation model needs diverse tasks during training because brittle single-task competence does not transfer well to real homes or factories. — Source: GEAR Lab

Part 4: Automate the Bottlenecks

  1. On Eureka's target: Eureka attacked reward design, one of the most stubborn human bottlenecks in reinforcement learning. — Source: Eureka Project
  2. On coding rewards: Eureka used coding LLMs to generate and improve reward functions, turning reward design into an iterative program-search problem. — Source: Eureka arXiv
  3. On simulation loops: Eureka used GPU-accelerated simulation in Isaac Gym to evaluate batches of candidate reward programs quickly. — Source: Eureka Project
  4. On robot dexterity: Fan highlights Eureka because it taught a five-finger robot hand difficult skills like pen spinning, showing that language models can help with physical control indirectly. — Source: Jim Fan Website
  5. On the broader pattern: Eureka fits Fan's larger thesis: use foundation models to automate the parts of robotics research that used to require slow expert hand-design. — Source: GEAR Lab
  6. On reward functions as code: Once the reward is code, agents can mutate, test, and revise it, which makes the training loop more scalable than manual tuning. — Source: Eureka arXiv
  7. On infrastructure leverage: The important artifact is the loop: generate candidate code, run simulation, score behavior, and feed the result back into the model. — Source: Eureka Project

Part 5: Build Foundation Models for Humanoids

  1. On Project GR00T: NVIDIA announced Project GR00T as a general-purpose foundation model initiative for humanoid robots. — Source: NVIDIA Newsroom
  2. On humanoid learning: GR00T is meant to help robots understand natural language and learn coordination, dexterity, and other skills from human actions. — Source: NVIDIA Newsroom
  3. On platform thinking: The GR00T work sits inside a broader Isaac robotics stack that includes simulation, data generation, and workflow infrastructure. — Source: NVIDIA Isaac GR00T
  4. On GR00T N1: GR00T N1 was released as an open foundation model for generalist humanoid robots. — Source: GR00T N1 Research
  5. On dual systems: GR00T N1 uses a vision-language module for interpretation and a diffusion transformer module for generating fluid motor actions. — Source: GR00T N1 arXiv
  6. On data mixture: The GR00T N1 recipe uses real robot trajectories, human videos, and synthetic data rather than betting on one data source. — Source: GR00T N1 Research
  7. On deployment: The paper reports deployment on a Fourier GR-1 humanoid for language-conditioned bimanual manipulation. — Source: GR00T N1 arXiv
  8. On ecosystem strategy: Fan's robotics strategy is also a partner strategy: make a model and data platform that many humanoid companies can adapt. — Source: NVIDIA Isaac GR00T
  9. On synthetic data: GR00T-Gen and related workflows use simulation-ready environments and synthetic trajectories to expand robot training coverage. — Source: NVIDIA Technical Blog

Part 6: Treat Human Video as Robot Fuel

  1. On EgoScale: EgoScale reframed dexterous robot learning as a scaling problem over egocentric human video. — Source: EgoScale Project
  2. On data scale: EgoScale trained on more than 20,854 hours of action-labeled egocentric human video, over 20 times larger than prior efforts cited by the project page. — Source: EgoScale Project
  3. On the scaling law: The EgoScale team reported a near-perfect log-linear relationship between human data scale and validation loss. — Source: EgoScale arXiv
  4. On robot transfer: EgoScale combined large-scale human pretraining with a small amount of aligned human-robot mid-training to improve downstream dexterous manipulation. — Source: EgoScale Project
  5. On high-DoF hands: The policy improved average success rate over a no-pretraining baseline using a 22-degree-of-freedom robotic hand. — Source: EgoScale arXiv
  6. On the data bottleneck: Fan's newer work makes the case that robotics progress depends less on isolated teleoperation datasets and more on reusable human-motion priors. — Source: EgoScale Project
  7. On embodiment transfer: EgoScale's transfer across higher- and lower-DoF hands supports the idea that human motion can become an embodiment-agnostic training signal. — Source: EgoScale arXiv

Part 7: Move From Simulation to World Models

  1. On DreamGen: DreamGen uses video world models to generate neural trajectories that help robot policies generalize across behaviors and environments. — Source: DreamGen Project
  2. On pseudo-actions: Because video models generate videos rather than robot commands, DreamGen recovers pseudo-action sequences through latent action or inverse-dynamics models. — Source: DreamGen arXiv
  3. On zero-shot generalization: DreamGen reports zero-shot behavior and environment generalization after using teleoperation data from only a single pick-and-place task in one environment. — Source: DreamGen Project
  4. On DreamDojo: DreamDojo is a generalist robot world model trained from large-scale human videos, then post-trained on small target-robot data. — Source: DreamDojo arXiv
  5. On open artifacts: NVIDIA released DreamDojo artifacts on Hugging Face, continuing Fan's preference for public research objects rather than closed demos alone. — Source: DreamDojo Hugging Face
  6. On DreamZero: DreamZero introduced a World Action Model built on a pretrained video diffusion backbone. — Source: DreamZero arXiv
  7. On replacing VLA limits: The World Action Model direction tries to make vision and action first-class rather than appending actions to a language-heavy model. — Source: DreamZero PDF
  8. On the systems arc: The progression from MineDojo to Voyager to GR00T to DreamZero is a progression from game agents to executable skills to robot policies to learned physical futures. — Source: GEAR Lab

Part 8: The Research Operating System

  1. On GEAR's mission: GEAR's stated mission is to build foundation models for embodied agents in both virtual and physical worlds. — Source: GEAR Lab
  2. On the research portfolio: GEAR organizes its work around multimodal foundation models, general-purpose robots, foundation agents, and simulation or synthetic data. — Source: GEAR Lab
  3. On career positioning: Fan's own biography connects Stanford vision research, early OpenAI experience, NVIDIA robotics, and a consistent interest in agents that act in worlds. — Source: Jim Fan Website
  4. On OpenAI roots: Fan notes that he was OpenAI's first intern in 2016 and worked on World of Bits, an early agent project for browser control from pixels. — Source: Jim Fan Website
  5. On public thought leadership: Fan's recent AI Ascent appearance placed his robotics argument in front of founders and researchers focused on the next phase of AI infrastructure. — Source: Sequoia AI Ascent 2026
  6. On the robotics end game: His 2026 argument is that robotics can follow a parallel path to language models: pretraining, action tuning, reinforcement learning, and eventually automated physical research. — Source: Zeus AI Summary
  7. On a useful prediction: The most actionable version of Fan's thesis is not a timeline; it is the stack: human video, world models, simulation, robot foundation models, and automated evaluation loops. — Source: GEAR Lab
  8. On the durable lesson: Fan's work argues that embodied AI advances when research teams build data engines and training loops that make every successful behavior reusable. — Source: Jim Fan Website