
How the same AI revolution that created chatbots and image generators is now teaching robots to see, reason and act in the physical world.
For decades, programming a robot to perform a new task meant starting almost from scratch. Every object, every environment, every sequence of movements had to be painstakingly coded or trained in isolation. A robot arm that could pick up bolts on a production line was useless at sorting parcels. A warehouse bot that navigated aisles flawlessly would be lost in a hospital corridor. Each new capability required its own bespoke software, its own dataset, its own engineering effort.
Foundation models are changing that equation — rapidly and fundamentally. The same class of large-scale, pre-trained AI models that gave us conversational chatbots, code assistants and photorealistic image generators is now being adapted for robotics. The goal is nothing less than a general-purpose "brain" for robots: a single model that can understand language instructions, perceive the physical world through cameras and sensors, reason about what needs to happen, and then output the precise motor commands to make it happen.
This is not a distant aspiration. As of early 2026, several foundation models purpose-built for robotics are already in active development and, in some cases, deployed on real hardware — including humanoid robots. The field is moving fast enough that researchers have begun comparing the current moment to the early days of GPT: crude by future standards, but unmistakably the start of something transformative.
What Is a Foundation Model?
A foundation model is a large neural network trained on vast, diverse datasets so that it develops broad, general capabilities that can then be adapted — or "fine-tuned" — for specific downstream tasks. The concept was formalised by Stanford researchers in 2021, but the principle had already been proven in practice by models like GPT-3 for language and CLIP for vision.
The defining characteristic of a foundation model is generality through scale. Rather than training a narrow model on a narrow dataset for a narrow task, you train an enormous model on an enormous dataset and trust that it will learn transferable representations — patterns, concepts and relationships — that remain useful across a wide range of applications. In natural language processing, this means a single pre-trained model can summarise documents, translate languages, write code and answer questions. In computer vision, a single model can recognise objects, describe scenes and detect anomalies.
The robotics community has watched these advances with a mixture of excitement and envy. The excitement is obvious: if a foundation model can learn to understand images and language from web-scale data, perhaps it can also learn to understand physical interactions — how objects behave when pushed, how fabric folds, how a door handle turns. The envy stems from a practical problem: while the internet provides billions of images and trillions of words for training language and vision models, there is no equivalent ocean of robotic interaction data. Robots are expensive, physical experiments are slow, and every laboratory's hardware is different.
Overcoming this data bottleneck is one of the central challenges driving the field — and one of the reasons the current generation of robotic foundation models has adopted some creative solutions.
From Language Models to Robot Control: The Evolution
The application of foundation models to robotics has evolved through several distinct phases, each building on the last.
Phase 1: Language Models as Planners
The earliest approaches used large language models (LLMs) purely as high-level planners. Google's SayCan project in 2022 was a landmark example: a language model would receive a natural language instruction — "I spilled my drink, can you help?" — and generate a step-by-step plan of abstract actions ("find sponge," "pick up sponge," "go to spill," "wipe spill"). Each abstract action was then grounded by checking whether the robot physically had the ability to perform it, using a learned "affordance" function. The actual motor control was handled by separate, task-specific policies.
This was clever, but limited. The language model had no visual understanding of the scene. It could not see that the sponge was already in the robot's gripper, or that the spill was on an elevated surface. It reasoned entirely through text, and the gap between linguistic planning and physical execution was bridged by hand-crafted modules.
Phase 2: Multimodal Models — Seeing and Reasoning
The next step was to give the model eyes. Google's PaLM-E, published in early 2023, was one of the first large-scale "embodied" multimodal models — a system that combined a 540-billion-parameter language model with visual inputs from robot cameras. PaLM-E could look at a scene, understand a spoken or written instruction, and produce a plan grounded in what it actually saw. It demonstrated the ability to guide mobile robots through multi-step tasks based on verbal instructions while also excelling at general vision-and-language benchmarks — showing that a single model could serve as both a general AI assistant and a robotic reasoning engine.
However, PaLM-E and similar models still operated at the level of abstract plans. They would output high-level action descriptions — "pick up the red cup" — which then needed to be translated into actual joint angles and motor commands by lower-level controllers. The model could think, but it could not yet move.
Phase 3: Vision-Language-Action Models — The Full Stack
The real breakthrough came with vision-language-action (VLA) models: systems that take in camera images and language instructions and directly output low-level robot actions — the actual motor commands that move joints, close grippers and navigate through space. This collapses what had been a multi-stage pipeline (perceive → plan → execute) into a single end-to-end model.
Google DeepMind's Robotics Transformer 2 (RT-2), published in mid-2023, established this paradigm. RT-2 took a pre-trained vision-language model and fine-tuned it on robot demonstration data, representing motor commands as sequences of text tokens — in effect, teaching the language model a new "language" of physical actions. The result was a model that could interpret novel instructions, reason about objects it had never seen during robot training (using knowledge inherited from its web-scale pre-training), and output the motor commands to act on its reasoning.
The implications were profound. When asked to pick up an "improvised hammer," RT-2 selected a rock — a connection it could only make by combining web-derived knowledge (rocks are hard, hammers are hard) with physical understanding (how to grasp a rock-shaped object). This kind of semantic transfer — from internet knowledge to physical action — is the core promise of foundation models for robotics.
The VLA Revolution: Key Models and Players
Since RT-2 opened the door, the development of VLA models has accelerated dramatically. Several models and platforms now represent the state of the art, each taking a slightly different approach to the same fundamental challenge.
π0 — Physical Intelligence
Announced in late 2024 by the San Francisco-based startup Physical Intelligence, π0 (pronounced "pi-zero") is one of the most ambitious generalist robot foundation models built to date. Founded by a team including leading researchers from Google and Stanford, Physical Intelligence set out to build a single model capable of controlling any robot to perform any task.
π0 is built on top of Google's PaliGemma vision-language model and trained on a combination of internet-scale vision-language data, the Open X-Embodiment dataset (a large collaborative robotics dataset, discussed below), and Physical Intelligence's own proprietary data from eight different robot platforms performing 68 distinct tasks. What makes π0 architecturally distinctive is its use of flow matching rather than discrete token prediction for generating actions. This allows the model to produce smooth, continuous motor commands at up to 50 Hz — fast enough for the kind of fluid, dexterous manipulation that previous VLAs struggled with.
In demonstrations, π0 showed the ability to fold laundry from a hamper, bus tables, assemble cardboard boxes and sort items — tasks requiring a level of dexterity and adaptive behaviour that had not been achieved by prior generalist systems. The model's weights and code were open-sourced in early 2025 through the openpi repository, making it one of the most accessible robot foundation models for the research community. Physical Intelligence's CEO, Karol Hausman, compared π0's current state to GPT-1 — functional and promising, but representing only the very beginning of what the approach can deliver.
NVIDIA Isaac GR00T N1
Announced at NVIDIA's GTC conference in March 2025, GR00T N1 is the first open foundation model specifically designed for humanoid robots. It uses a dual-system architecture inspired by the psychological distinction between fast, intuitive thinking and slow, deliberate reasoning. "System 2" — a vision-language module based on NVIDIA's Eagle-2 VLM — interprets the environment through camera images and language instructions. "System 1" — a diffusion transformer — then translates that understanding into continuous motor actions in real time.
GR00T N1 was trained on a diverse mixture of real robot trajectories, human demonstration videos, and massive quantities of synthetic data generated using NVIDIA's Omniverse simulation platform. In one striking demonstration of the value of synthetic data, NVIDIA reported generating 780,000 synthetic manipulation trajectories — equivalent to roughly nine months of continuous human demonstration — in just 11 hours. Combining this synthetic data with real-world data improved the model's performance by 40% compared to real data alone.
Critically, GR00T N1 is cross-embodiment: it can be post-trained for different humanoid platforms. At GTC, NVIDIA demonstrated it running on 1X Technologies' NEO robot performing household tidying tasks, and the model has been made available to leading humanoid developers including Agility Robotics, Boston Dynamics, Mentee Robotics and NEURA Robotics. The initial 2-billion-parameter model was released as open source, with larger and more capable versions planned.
Figure AI Helix
Unveiled in February 2025, Helix is Figure AI's proprietary VLA model, purpose-built for its humanoid robots. Helix is notable for being the first VLA demonstrated to control the entire upper body of a humanoid — arms, hands, torso, head and individual fingers — rather than just a single robotic arm.
Like GR00T N1, Helix uses a dual-system architecture. Its "System 2" is a large vision-language model specialised in scene understanding and language comprehension, while "System 1" is a visuomotor policy that translates the higher-level representations into continuous robot actions. The two systems are trained to communicate end-to-end. Helix was trained on approximately 500 hours of robot teleoperation data paired with automatically generated text descriptions, and it has been demonstrated performing tasks like folding clothes and placing dishes in a dishwasher on Figure's humanoid platforms.
Google DeepMind Gemini Robotics
Introduced in 2025, Gemini Robotics extends Google DeepMind's flagship Gemini 2.0 model into the physical domain. Because Gemini is natively multimodal — processing text, images, video and audio — the robotic extension leverages this broad perceptual foundation to enable highly dexterous manipulation. Demonstrations have included remarkably fine motor tasks such as folding origami and manipulating playing cards, alongside strong generalisation to novel platforms and environments.
In June 2025, Google released Gemini Robotics On-Device, a lightweight version optimised to run locally on robot hardware with low latency, addressing one of the key practical challenges of deploying large foundation models on physical platforms.
OpenVLA and SmolVLA — The Open-Source Frontier
Not all progress is happening at the billion-dollar scale. OpenVLA, a 7-billion-parameter open-source VLA developed by Stanford researchers in mid-2024, was trained on the Open X-Embodiment dataset and demonstrated that the VLA paradigm could be made accessible to the broader research community. It uses DINOv2 and CLIP for visual encoding with a Llama-2 language backbone, and outputs discrete action tokens — a simpler approach than flow matching but effective for a wide range of manipulation tasks.
Going further in the direction of accessibility, Hugging Face released SmolVLA in 2025 — a compact VLA with just 450 million parameters, trained entirely on the open-source LeRobot dataset. Despite being a fraction of the size of models like π0, SmolVLA achieved comparable performance on several benchmarks, demonstrating that effective VLA behaviour does not necessarily require enormous scale. SmolVLA uses flow matching for continuous control and asynchronous inference to decouple the VLM backbone from action execution, keeping latency manageable even on modest hardware.
How These Models Actually Work
While each model has its own architectural nuances, the general pattern of a modern robotic foundation model follows a common structure that is worth understanding.
Pre-training on Internet-Scale Data
Every current VLA starts with a pre-trained vision-language model as its backbone. This might be PaLM-E, PaliGemma, Eagle-2, Gemini or another large VLM. The key insight is that these models have already learned an enormous amount about the visual world and the structure of language from their web-scale training — knowledge that transfers remarkably well to robotic settings. A model that has seen billions of images of kitchens, tools, fabrics and household objects already has a rich understanding of what these things look like, how they relate to each other and what words describe them. This semantic understanding is the "free" foundation that makes robotic foundation models viable despite the relative scarcity of robot-specific training data.
Fine-tuning on Robot Data
The pre-trained VLM is then fine-tuned on datasets of robot demonstrations — typically video footage from the robot's cameras paired with the corresponding motor commands and language descriptions of the task being performed. This is where the model learns to connect its pre-existing visual and linguistic understanding to physical actions. The robot data may come from teleoperation (a human remotely controlling the robot while it records), autonomous data collection, simulation, or — increasingly — large-scale collaborative datasets like Open X-Embodiment.
Action Representation
A crucial design choice is how robot actions are encoded. Two dominant approaches have emerged. The first, pioneered by RT-2, represents actions as discrete tokens — essentially treating motor commands as another "language" that the model generates alongside text. Each action becomes a sequence of integer tokens encoding things like end-effector position, rotation and gripper state. This approach is elegant and integrates naturally with the language model's existing token-generation machinery, but it can sacrifice precision when fine motor control is required.
The second approach, championed by π0 and adopted by GR00T N1 and Helix, uses flow matching or diffusion models to generate continuous actions directly. Rather than discretising movements into tokens, the model produces smooth, high-frequency trajectories through a denoising process. This enables more precise and fluid control, particularly for dexterous manipulation tasks, and scales better to robots with many degrees of freedom. The trade-off is greater computational cost at inference time.
Dual-System Architectures
Several of the most recent models — GR00T N1, Helix and others — have adopted a dual-system design that separates high-level perception and reasoning (System 2) from low-level motor control (System 1). System 2, typically a large VLM, processes images and language instructions slowly and deliberately. System 1, typically a smaller and faster diffusion-based or flow-matching model, generates real-time motor actions based on the representations produced by System 2. This decoupling allows the system to combine broad generalisation with fast, precise physical control — a practical necessity for robots that need to react to a changing environment in real time.
The Data Challenge
If architecture is the engine of robotic foundation models, data is the fuel — and it remains the field's most significant bottleneck.
Large language models train on trillions of tokens of text scraped from the internet. Vision models train on billions of image-text pairs. Robotic foundation models, by contrast, must work with dramatically less data. Physical experiments are slow, hardware is expensive, every laboratory has different robots, and the diversity of real-world environments is practically infinite. No single organisation can collect enough data on its own.
Open X-Embodiment
The most significant collaborative effort to address this challenge is the Open X-Embodiment (OXE) dataset, an initiative led by Google DeepMind that pooled robotic demonstration data from 21 research institutions worldwide. The resulting dataset contains over one million real robot trajectories spanning 22 different robot embodiments — from single-arm manipulators and bimanual systems to mobile platforms and quadrupeds — performing more than 500 distinct manipulation skills.
OXE was a landmark for the field, not just for its scale but for its proof of concept: models trained on this diverse, cross-embodiment data showed clear positive transfer. Google's RT-2-X, trained on the OXE dataset, achieved roughly three times the generalisation performance of models trained only on data from the evaluation robot. Policies trained on many different robots became better at controlling each individual robot than policies trained on that robot's data alone. This echoed the pattern seen in NLP and computer vision, where diverse pre-training data consistently outperforms narrow, task-specific datasets.
OXE is now widely used as a pre-training foundation by models including π0, OpenVLA and others, and it continues to grow as more institutions contribute their data.
Synthetic Data and Simulation
An increasingly important complement to real-world data is synthetic data generated in physics simulators. NVIDIA's approach with GR00T N1 is illustrative: using the Omniverse platform and Cosmos world foundation models, NVIDIA generated hundreds of thousands of synthetic manipulation trajectories from a small number of real human demonstrations — creating months' worth of equivalent training data in hours.
The appeal is obvious: synthetic data is cheap, infinitely scalable, fully annotated and can cover scenarios that would be dangerous or impractical to stage in reality. The challenge is the sim-to-real gap — the inevitable differences between simulated physics and real-world physics that can cause a policy trained in simulation to fail when deployed on actual hardware. Bridging this gap through domain randomisation, physics engine improvements and careful transfer learning is an active area of research. The development of Newton, a new open-source physics engine being built collaboratively by NVIDIA, Google DeepMind and Disney Research, specifically targets this problem by providing more accurate and computationally efficient simulation of the physical interactions that robots encounter.
For a more detailed look at training robots in simulation, see our article on Sim-to-Real Transfer: Training Robots in Virtual Worlds.
Human Video Data
A third data source gaining importance is ordinary video of humans performing tasks — cooking, cleaning, assembling, repairing. These videos are abundant online and capture an enormous diversity of physical interactions, objects and environments. While they do not include robot actions, they provide rich demonstrations of how tasks are structured, how objects behave, and what successful outcomes look like. Models like GR00T N1 incorporate human video data as part of their training pyramid, using it to build a broad foundation of visual and task understanding that is then refined with robot-specific data.
What Foundation Models Mean for Humanoid Robots
Foundation models are particularly significant for humanoid robotics — and not by accident. The humanoid form factor, with its many degrees of freedom, multiple limbs and need to operate in unstructured human environments, has always posed the hardest control challenges in robotics. A humanoid robot performing household tasks might need to coordinate two arms, ten or more fingers, a mobile base and head-mounted cameras simultaneously, while adapting to infinite variations in object placement, lighting, surface textures and task requirements.
Traditional approaches — programming each behaviour individually or training narrow task-specific models — simply cannot scale to this level of complexity. Foundation models offer a different path: train a single, general model on diverse enough data that it develops broad physical competence, then fine-tune it for specific platforms and tasks.
This is exactly the approach being taken by the leading humanoid developers. NVIDIA's GR00T N1 has been adopted by companies across the humanoid ecosystem. Figure AI's Helix drives its humanoid robots' upper-body manipulation. 1X Technologies demonstrated its NEO humanoid performing domestic tidying tasks powered by a GR00T-based policy at GTC 2025. The emerging pattern is clear: the humanoid hardware platforms are converging on foundation models as their control layer, in the same way that smartphones converged on general-purpose operating systems rather than bespoke firmware for each device.
For more on how AI systems control humanoid robots, see our deep dive on AI & The Robot Brain. For an overview of current humanoid platforms, see our Comparison Guide: Current Humanoid Robots.
Limitations and Open Challenges
For all their promise, robotic foundation models in early 2026 remain far from mature. Several fundamental challenges persist.
Physical Precision
Current VLAs are strongest at semantic understanding — knowing what to do — and weakest at fine physical execution — doing it with precision. Picking up an object from a table works reliably; threading a needle does not. The gap between linguistic reasoning and sub-millimetre motor accuracy remains significant, and tasks that require sustained force control, compliant manipulation or dynamic physical interaction (catching a ball, wiping a surface with consistent pressure) are still largely unsolved by foundation model approaches.
Long-Horizon Reasoning
Most current models perform best on tasks that can be completed in under a minute. Longer, multi-stage tasks — cooking a meal, tidying an entire room, assembling a piece of furniture — require sustained reasoning, memory of what has been done, and the ability to recover from failures and unexpected situations. This is an active frontier, with approaches like Physical Intelligence's Multi-Scale Embodied Memory (MEM) beginning to give models both long-term and short-term recall for tasks lasting more than ten minutes.
Safety and Robustness
A robot operating in a home or workplace cannot afford the failure modes that are acceptable in a language model. A chatbot that occasionally produces a wrong answer is a nuisance; a humanoid robot that occasionally misidentifies a fragile object or misjudges force could cause injury or damage. Ensuring that foundation model-controlled robots are safe, predictable and transparent in their decision-making is a challenge that the field has only begun to address — and one that regulators are watching closely.
For more on the regulatory landscape, see our article on Safety Standards & Regulation for Humanoid Robots.
Computational Demands
Large foundation models require significant computing resources, both for training and for inference. Training GR00T N1 consumed approximately 50,000 GPU hours on NVIDIA H100 hardware. At inference time, the model needs to run fast enough to generate actions in real time — a challenge when the robot's onboard compute is limited compared to a data centre. The development of lighter models like Gemini Robotics On-Device and SmolVLA reflects the practical imperative to reduce these requirements, and model compression, quantisation and efficient inference strategies are critical enabling technologies.
Data Diversity and Bias
Foundation models learn from their training data, and the available robotic datasets, while growing, still under-represent many real-world environments, cultural contexts and use cases. A model trained predominantly on data from American and European research labs may not generalise well to homes and workplaces in other parts of the world. As these models move toward commercial deployment, ensuring diversity and representativeness in training data will become increasingly important.
Where This Is Heading
The trajectory of foundation models for robotics closely mirrors the trajectory of foundation models for language — on a time lag of roughly two to three years. The field has passed through its "GPT-1 moment" (models that work but are limited), is entering its "GPT-2/3 moment" (rapidly scaling models with increasingly impressive generalisation), and is heading toward its "GPT-4 moment" (models capable enough for real-world commercial deployment at scale).
Several trends are likely to define the next phase:
- Scaling laws for robotics: Researchers are actively working to establish whether the scaling laws that govern language models — where performance improves predictably with more data and compute — hold in the robotic domain. Early results from companies like Generalist AI, which is training models on over 270,000 hours of real-world manipulation data, suggest that robotic model performance does scale with data — but that the diversity of environments and objects may matter more than sheer volume of demonstrations.
- Reinforcement learning from real-world experience: Current models are primarily trained via imitation learning — watching demonstrations and learning to replicate them. The next generation will increasingly learn from their own experience through reinforcement learning, allowing robots to improve autonomously through trial and error after deployment. Physical Intelligence has already published work on training its generalist policies with RL to improve real-world success rates.
- Open-source acceleration: The open-sourcing of models like π0, GR00T N1, OpenVLA and SmolVLA, alongside open datasets like OXE and open tools like Hugging Face's LeRobot, is creating a flywheel effect. More accessible models attract more researchers, who generate more data and innovations, which improve the models further. This mirrors the open-source dynamics that accelerated progress in large language models.
- Specialisation within generality: While the long-term vision is a single general-purpose model, the practical near-term pattern is likely to be general pre-training followed by domain-specific post-training — a general-purpose robotics foundation model fine-tuned for warehouse logistics, or surgical assistance, or domestic service. This mirrors the pre-train-then-specialise pattern that has proven effective in NLP.
Why This Matters
Foundation models for robotics represent something more than an incremental improvement in robot software. They represent a fundamental change in how robots acquire capabilities. Instead of being programmed task by task, robots are learning to understand the world — and their place in it — from the same vast reservoirs of human knowledge that power modern AI.
For the humanoid robotics industry specifically, this is arguably the missing piece. The hardware has been advancing steadily — bipedal locomotion, dexterous hands, lightweight actuators, improved batteries. But without a control system capable of general intelligence — one that can handle the open-ended, unpredictable demands of human environments — even the most impressive hardware is limited to scripted demonstrations and controlled settings.
Foundation models are the bridge from demonstration to deployment. The bridge is still being built, and there is a great deal of engineering, research and real-world validation between here and the finish line. But the direction is unmistakable. The robot brain is being rewritten — and the implications, for industry, for labour, for daily life, are as significant as they are uncertain.
For a broader perspective on the AI systems that power humanoid robots, explore our full AI–Robotics Intersection section.