
Understanding the software, models, and learning systems that give humanoid robots the ability to perceive, reason, and act in the real world.
A humanoid robot's body — its motors, sensors, and mechanical joints — is only half the story. Without an intelligent software system to process information and make decisions, even the most beautifully engineered robot is just an expensive mannequin. The real breakthrough driving the current generation of humanoid robots isn't better hardware. It's artificial intelligence.
Over the past few years, rapid advances in machine learning, computer vision, and large-scale AI models have fundamentally changed what's possible. Robots that once followed rigid, pre-programmed scripts can now interpret spoken instructions, recognise objects they've never seen before, and adapt their behaviour to new situations — at least some of the time, and within limits. This article explains how.
The Software Stack: How a Robot's Brain Is Organised
A humanoid robot's intelligence isn't a single program. It's a layered system, often compared to the human nervous system. Engineers typically describe it as having a "big brain" for high-level thinking and a "little brain" for low-level motor control.
At the top sits the cognitive layer — the AI system responsible for perception, understanding, and decision-making. This is where the robot processes camera images, interprets language commands, plans tasks, and decides what to do next. It runs on the robot's main processors or dedicated AI chips.
Below that is the motion control layer — a faster, more reactive system that translates high-level decisions into precise joint movements. Low-level microcontrollers near each joint handle rapid motor control loops, managing things like torque, position, and speed many times per second. This layer keeps the robot balanced, moves its limbs smoothly, and responds to physical disturbances faster than the cognitive layer could.
The two layers must work together seamlessly. The cognitive layer might decide "pick up that cup," but it's the motion control layer that figures out exactly how to coordinate dozens of joints to reach out, grasp the handle, and lift without spilling.
Computer Vision: How Robots See
For a humanoid robot, vision is the primary sense. Modern robots use a combination of cameras, depth sensors, and sometimes LiDAR to build a rich understanding of their surroundings.
Raw camera images are processed by computer vision algorithms — increasingly based on deep neural networks — that can identify objects, estimate their position and orientation, recognise human faces and gestures, and track movement in real time. Depth sensors add a three-dimensional understanding of the scene, letting the robot know not just what something is, but where it is in space.
The challenge is doing all of this quickly and reliably. A robot reaching for a glass on a cluttered kitchen counter needs to identify the glass, estimate its exact position, plan a collision-free path for its arm, and adjust on the fly if something moves — all within fractions of a second. Current vision systems still struggle in some conditions that humans find trivial, such as low light, reflective surfaces, and transparent objects like clear plastic or glass.
Foundation Models: The AI Revolution Hitting Robotics
The single biggest shift in robot AI over the past two years has been the arrival of foundation models — large neural networks pre-trained on vast datasets that can be adapted to many different tasks.
You're already familiar with this idea if you've used a large language model (LLM) like ChatGPT or Claude. These models learn general knowledge about language from billions of text examples, then apply that knowledge to answer questions, write code, or hold conversations. The same principle is now being applied to robotics, with models that understand not just language, but also images, video, and physical actions.
Vision-Language-Action Models (VLAs)
The most significant new category of AI model in robotics is the vision-language-action model, or VLA. A VLA takes three types of input — a camera image of the robot's environment, a natural language instruction (like "put the red cup in the sink"), and the robot's current physical state — and directly outputs low-level motor commands that the robot can execute.
This is a fundamental departure from older approaches, where perception, planning, and control were handled by entirely separate systems that had to be painstakingly integrated. A VLA attempts to do it all in one unified model.
Key VLA models that have emerged include:
- NVIDIA GR00T N1 — Released in March 2025 as the first open foundation model for humanoid robots, GR00T N1 uses a dual-system architecture: a vision-language model for reasoning and planning, paired with a diffusion transformer for generating smooth, continuous robot movements. NVIDIA updated it to version N1.6 in early 2026 with improved perception and a larger action model. It has been adopted by companies including Agility Robotics, Boston Dynamics, and 1X Technologies.
- Figure AI's Helix — A VLA designed specifically for whole-body humanoid control, Helix was the first model capable of controlling an entire humanoid upper body — arms, hands, torso, head, and fingers — from a single neural network. It was trained on roughly 500 hours of teleoperation data. The updated Helix 02 can perform long-horizon tasks like autonomously unloading and reloading a dishwasher across multiple rooms, integrating walking, manipulation, and balance.
- Physical Intelligence's π0 — A generalist VLA trained on trajectories from eight different robot types, π0 introduced flow-matching techniques for generating high-frequency continuous actions at up to 50 Hz. Its successor, π0.6, learns from real-world experience to improve over time.
- Google DeepMind's Gemini Robotics — Built on the Gemini 2.0 multimodal model, Gemini Robotics extends the model's reasoning capabilities into physical action, enabling highly dexterous tasks such as folding origami and manipulating playing cards. A lightweight on-device version was released in mid-2025 for low-latency robot control.
- Hugging Face's SmolVLA — An open-source compact model with just 450 million parameters, SmolVLA demonstrates that effective robot control doesn't necessarily require massive models, achieving comparable performance to much larger systems.
As of mid-2025, most VLAs ranged from 500 million to 7 billion parameters — large by robotics standards, but small compared to the biggest language models.
How VLAs Actually Work
At a simplified level, a VLA has two main components. A vision-language encoder (typically a vision transformer) processes the camera image and the text instruction together, converting them into a rich internal representation — a kind of compressed understanding of "what's happening and what needs to be done." An action decoder then transforms that representation into a sequence of motor commands the robot can execute.
The action decoder is where much of the current innovation is happening. Early approaches simply predicted a single next action, but modern VLAs use techniques borrowed from image generation — particularly diffusion models and flow matching — to generate entire "chunks" of smooth, temporally coherent movements. This produces more natural, fluid motion rather than jerky, frame-by-frame actions.
Reinforcement Learning: How Robots Learn Through Trial and Error
Before the VLA era, and still very much alongside it, reinforcement learning (RL) has been the dominant approach for teaching robots physical skills — especially locomotion.
The core idea is simple: the robot tries something, gets a reward signal indicating how well it did, and gradually adjusts its behaviour to maximise that reward. In practice, this means running millions of simulated trials where a virtual robot attempts to walk, recover from pushes, climb stairs, or manipulate objects. Over time, the AI discovers effective strategies that no human engineer explicitly programmed.
RL has been particularly transformative for bipedal walking. Traditional approaches relied on carefully hand-crafted physics models and control equations — methods that worked in controlled lab settings but struggled to adapt to the unpredictable real world. Learning-based approaches, by contrast, can develop robust walking policies that handle uneven terrain, external disturbances, and novel situations.
A landmark demonstration came when researchers trained a causal transformer model using large-scale RL in simulation, then deployed it on Agility Robotics' Digit humanoid with zero real-world training. The robot could walk over various outdoor terrains, recover from pushes, and even exhibited emergent behaviours like natural arm swinging that nobody explicitly programmed.
More recently, RL-trained humanoids have demonstrated increasingly athletic capabilities — climbing boxes, performing parkour manoeuvres, executing martial arts movements, and even recovering from sitting positions on uneven ground. Unitree's humanoid robots performing synchronised kung fu at China's 2026 Spring Festival Gala showcased how far these techniques have progressed.
The Challenge: Reward Design and Safety
RL sounds straightforward in principle, but in practice, designing good reward functions is notoriously difficult. A poorly designed reward can lead to unexpected and sometimes dangerous behaviours — the robot might find clever shortcuts that technically maximise the reward but look nothing like the intended behaviour. Researchers are exploring techniques like inverse reinforcement learning, where the AI infers what the reward should be by watching human demonstrations, and reward learning, where another AI model learns to evaluate the robot's performance.
Safety is a fundamental concern. A robot exploring through trial and error in the real world could damage itself, its surroundings, or people nearby. This is why almost all RL training happens in simulation first.
Sim-to-Real: Training in Virtual Worlds
One of the most important concepts in modern robot AI is sim-to-real transfer — training a robot's AI in a simulated environment and then deploying the learned behaviour on a physical robot.
Physics simulation engines like NVIDIA Isaac Sim, MuJoCo (developed by Google DeepMind), and the new open-source Newton engine (a collaboration between NVIDIA, Google DeepMind, and Disney Research) create virtual environments where robots can train millions of times faster than in the real world, without any risk of physical damage.
The approach is powerful because simulation can generate enormous amounts of training data. NVIDIA demonstrated this vividly: using their Isaac GR00T Blueprint, they generated 780,000 synthetic motion trajectories — equivalent to roughly nine continuous months of human demonstration data — in just 11 hours. Combining this synthetic data with real-world data improved their GR00T N1 model's performance by 40%.
But there's a catch. Simulations are never perfectly accurate. The gap between virtual physics and real-world physics — known as the reality gap — means that a policy that works flawlessly in simulation might fail on a real robot. Surfaces have different friction than expected. Joints have slightly different properties. Lighting changes. Objects behave unexpectedly.
Researchers address this through a technique called domain randomisation: during training, the simulation randomly varies things like surface friction, object mass, lighting conditions, sensor noise, and motor response. By training across thousands of randomised environments, the AI learns policies robust enough to handle the messiness of the real world. This is how zero-shot sim-to-real transfer — deploying a policy on a real robot with no additional real-world training — has become increasingly achievable.
Imitation Learning: Teaching by Demonstration
Not everything a robot needs to learn lends itself to trial-and-error. For many manipulation tasks — folding clothes, loading a dishwasher, handing someone an object — it's more practical to show the robot what to do.
Imitation learning (sometimes called learning from demonstration) involves a human operator controlling the robot through a task, typically via teleoperation, and then training the AI to replicate that behaviour. The robot learns a policy that maps from what it sees and feels to the actions it should take.
Teleoperation methods are becoming increasingly sophisticated. Some companies use VR headsets and hand controllers, allowing operators to intuitively guide a robot through complex bimanual tasks. NVIDIA's pipeline even supports the Apple Vision Pro as a spatial computing interface for recording robot demonstrations in simulation.
The latest VLA models combine imitation learning with their broader training. A relatively small number of human demonstrations can be used to "post-train" or fine-tune a foundation model for a specific task or robot, while the foundation model provides general capabilities learned from much larger datasets.
Large Language Models: The Reasoning Layer
Large language models have found a somewhat unexpected role in robotics — not as motor controllers, but as high-level reasoning and planning engines.
When a humanoid robot receives a complex instruction like "tidy up the kitchen," it needs to break that down into a sequence of concrete actions: identify which items are out of place, determine where each belongs, plan the order of operations, and handle exceptions. LLMs are well suited to this kind of semantic reasoning and task decomposition.
In practice, this often works as a hierarchical system. The LLM handles high-level planning and natural language understanding, while a separate control policy — often a VLA or RL-trained model — handles the physical execution. The LLM might decide "first, pick up the mug from the table and place it in the dishwasher," and the lower-level controller figures out the motor commands to actually do it.
This integration of language models into robotic systems is part of a broader trend called embodied AI — the idea that meaningful intelligence requires not just processing information, but physically interacting with the world. Proponents argue that connecting AI to a physical body grounds its understanding in reality in ways that purely digital AI cannot achieve.
The Data Challenge
AI models are only as good as the data they're trained on, and robotics faces a particularly acute data problem. While language models can train on trillions of words of text scraped from the internet, there is no equivalent ocean of robot manipulation data.
Real-world robot data is expensive and slow to collect — it requires physical robots, human operators, and careful setup. The Open X-Embodiment dataset, one of the largest collaborative efforts, unifies data from 22 different robot types and more than 500 tasks, but even this is modest compared to datasets in other AI domains.
This is why synthetic data generation has become so important. The combination of simulation, domain randomisation, and increasingly realistic rendering (using tools like NVIDIA's Cosmos world foundation models to add photorealism to simulated scenes) is helping to close the data gap. Internet-scale video of humans performing tasks is another valuable data source — robots can learn about the physics of manipulation by watching how humans interact with objects, even though the robot's body is different.
Scaling laws observed in language models — where performance improves predictably as you increase data and model size — appear to hold in robotics too, though recent research suggests that in robotics, diversity of environments and objects matters more than raw demonstration count.
On-Board Computing: The Hardware Behind the Software
All of this AI needs somewhere to run. Modern humanoid robots carry powerful on-board computers, typically combining CPUs for general processing with GPUs or dedicated AI accelerators for running neural networks.
The computing architecture is usually distributed. High-level AI — the VLA model, language processing, vision — runs on the main compute module. Low-level motor control runs on dedicated microcontrollers positioned close to each joint, handling rapid feedback loops that need to execute thousands of times per second.
There's an inherent tension between model capability and on-board compute constraints. The most capable AI models are large and computationally expensive, but a humanoid robot has limited power, cooling, and space for processors. This is why some approaches use cloud computing for heavy AI processing, while others focus on developing smaller, more efficient models that can run entirely on-board. Google DeepMind's Gemini Robotics On-Device and Hugging Face's SmolVLA represent this push toward efficient, on-device AI.
Where Things Actually Stand
It's important to be honest about the current state of the art. The AI powering humanoid robots has advanced remarkably, but significant limitations remain.
Most deployed humanoid robots in early 2026 are performing narrow, specific tasks in controlled environments — moving totes in warehouses, loading parts on assembly lines, executing pre-defined logistics workflows. True general-purpose autonomy, where a robot can handle arbitrary tasks in unstructured environments, remains a research goal rather than a commercial reality.
Key limitations include:
- Reliability — Current AI systems work well most of the time, but "most of the time" isn't good enough when a robot is working alongside people. Edge cases, unexpected situations, and novel objects still frequently cause failures.
- Battery life — Most humanoid robots operate for only about two hours on a charge, which limits the complexity and duration of tasks they can perform.
- Dexterity — Fine motor control and tactile sensitivity still lag far behind human capabilities. Robots can pick up a box, but delicately threading a needle or handling a raw egg remains extremely challenging.
- Latency — Running large AI models in real time on embedded hardware is difficult. There's always a trade-off between model sophistication and response speed.
- Safety certification — AI systems that learn from data are inherently harder to certify as safe compared to traditional engineered systems with predictable behaviour. As of early 2026, no humanoid robot has achieved full functional safety certification for working cooperatively alongside people without physical barriers, though Agility Robotics has targeted this milestone for later in the year.
What to Watch Next
The AI side of humanoid robotics is moving fast. Several trends will shape the next phase:
Larger and more capable VLAs will continue to improve, likely following scaling patterns similar to those seen in language models. Expect models that can handle longer task horizons, more complex reasoning, and more diverse environments.
World models — AI systems that build internal simulations of how the physical world works — are an emerging area. NVIDIA's Cosmos platform and similar efforts aim to give robots a predictive understanding of physics, so they can anticipate what will happen before they act.
Multi-robot coordination, where fleets of humanoid robots share learned skills and coordinate tasks in real time, is moving from research into early deployment. Boston Dynamics and others are committing fleets to production environments where robots work together.
On-device efficiency will become increasingly important as robots move from lab settings to real-world deployment, where cloud connectivity can't always be guaranteed and latency matters.
The gap between impressive demos and reliable, everyday deployment remains the central challenge. But the trajectory is clear: AI is what's transforming humanoid robots from impressive engineering projects into potentially useful machines. The hardware was always waiting for the brain to catch up. That's now happening — faster than most people expected.
This article is part of Droid Brief's Resources section, a comprehensive reference library covering humanoid robotics from fundamentals to frontier research.