
Every child learns to walk by falling down. They wobble, topple, adjust, and try again — thousands of times — until the complex interplay of balance, momentum, and coordination becomes second nature. Reinforcement learning, or RL, applies a strikingly similar principle to humanoid robots: let them fail, reward what works, penalise what doesn't, and allow a capable enough algorithm to discover solutions that no human engineer could hand-design.
It is arguably the single most important AI technique driving the current generation of humanoid robots. And the results — machines that can walk over hiking trails, play soccer, recover from being shoved, and manipulate objects with dexterous hands — are accelerating faster than almost anyone predicted.
What Is Reinforcement Learning?
At its core, reinforcement learning is a branch of machine learning in which an agent (the robot, or more precisely its control policy) learns to make decisions by interacting with an environment. The agent takes an action, observes the result, receives a numerical reward or penalty, and adjusts its behaviour to maximise cumulative reward over time.
The framework is formally known as a Markov Decision Process (MDP). At each time step, the agent observes the current state of the world, selects an action from its available repertoire, transitions to a new state, and receives a reward signal. Over thousands — or more typically, billions — of iterations, the agent learns a policy: a mapping from observed states to optimal actions.
What makes RL particularly well-suited to robotics is that it does not require explicit programming of every possible scenario. Instead of a human engineer writing rules for how a robot should respond to being pushed on a slope while carrying a load, the RL agent discovers its own solutions through experience. Many of these solutions are more robust and more efficient than anything a human designer would think to code.
Why Classical Control Isn't Enough
For decades, humanoid robots have been controlled using model-based methods — approaches that rely on precise mathematical models of the robot's dynamics. Techniques such as zero moment point (ZMP) control and model predictive control (MPC) have produced impressive demonstrations, most famously in Boston Dynamics' Atlas robot performing parkour routines.
These classical approaches work well in controlled environments, but they have fundamental limitations. They require accurate models of the robot and its surroundings, which are difficult to build and impossible to generalise across every situation a robot might encounter. They demand significant human expertise to design and tune. And they tend to be brittle: a controller painstakingly optimised for flat concrete may fail on gravel, grass, or a wet floor.
Reinforcement learning offers a different philosophy. Rather than encoding explicit knowledge about physics and dynamics, the system discovers effective control strategies from raw experience. As research has increasingly demonstrated, RL-trained controllers can match or surpass hand-engineered systems — and they generalise far better to environments they've never seen before.
From Simulation to Reality: The Sim-to-Real Pipeline
Training a humanoid robot through trial and error in the physical world is, for now, largely impractical. A full-size humanoid falling over repeatedly would damage itself, its surroundings, and anyone nearby. Training cycles that require billions of attempts would take years on a physical platform. The solution is to train in simulation first and then transfer the learned behaviour to a real robot — a process known as sim-to-real transfer.
The role of physics simulators
Modern physics simulators such as NVIDIA Isaac Sim, MuJoCo, and PyBullet can model a robot's body, its joint dynamics, the forces of gravity and friction, and its interactions with the environment in photorealistic detail. Crucially, these simulators can run thousands of parallel instances simultaneously, allowing an RL agent to accumulate the equivalent of years of physical experience in a matter of hours.
The scale of this parallelisation is staggering. A typical RL training run for humanoid locomotion might involve tens of thousands of simulated robots learning simultaneously across a cluster of GPUs, collectively taking hundreds of millions of steps before a viable policy emerges.
The reality gap
The catch is that no simulator perfectly replicates the real world. Friction behaves differently on real surfaces than in simplified physics models. Actuators have latencies and imprecisions that are hard to model exactly. Soft contacts, deformable surfaces, and aerodynamic effects are approximated at best. This discrepancy between simulated and real-world physics is known as the reality gap, and it is one of the central challenges of the entire field.
A policy that performs flawlessly in simulation may stumble, fall, or behave erratically on a physical robot. Closing this gap — or making policies robust enough to tolerate it — is where much of the current research effort is concentrated.
Domain randomisation
One of the most effective techniques for bridging the reality gap is domain randomisation. Rather than trying to make the simulator perfectly match reality, engineers deliberately vary the simulation parameters across a wide range during training. Friction coefficients, masses, motor strengths, joint stiffness, sensor noise levels, and even the physical dimensions of the robot itself are randomised from one training episode to the next.
The logic is elegant: if the RL agent can learn to perform well across a huge variety of simulated conditions, the real world becomes just one more variation in the distribution. The agent never sees the exact real-world parameters during training, but it has learned to be robust to uncertainty. When transferred to a physical robot, the policy adapts on the fly to whatever conditions it encounters.
OpenAI's landmark work on dexterous manipulation — training a robotic hand to solve a Rubik's Cube — was a pivotal early demonstration of domain randomisation at scale. Since then, the technique has become a standard ingredient in almost every successful sim-to-real transfer pipeline.
Zero-shot transfer
The holy grail of sim-to-real is zero-shot transfer: deploying a simulation-trained policy onto a real robot with no additional fine-tuning. Remarkably, several recent projects have achieved this. In 2024, a UC Berkeley team led by Ilija Radosavovic demonstrated a fully learning-based locomotion controller for Agility Robotics' Digit humanoid that transferred to the real world zero-shot. The robot walked indoors and outdoors across varied terrain, adapted to external perturbations, and even exhibited an emergent arm-swinging gait — a behaviour the researchers never explicitly programmed, which the system discovered on its own because it improved stability.
In follow-up work later that year, the same team's controller — trained using a combination of sequence modelling and RL fine-tuning — enabled a Digit robot to complete over four miles of real hiking trails in the Berkeley hills and climb some of San Francisco's steepest streets, all using a single neural network with no terrain-specific adjustments.
Key Algorithms and Approaches
Proximal Policy Optimisation (PPO)
PPO, developed by OpenAI, is one of the most widely used RL algorithms in robotics. It strikes a practical balance between training stability and sample efficiency by limiting how much the policy can change in a single update step. Its relative simplicity and reliability have made it a default choice for many locomotion and manipulation tasks.
Soft Actor-Critic (SAC)
SAC adds an entropy bonus to the standard RL objective, encouraging the agent to explore diverse strategies rather than collapsing onto a single solution too early. This makes it particularly effective for tasks requiring nuanced, smooth control — qualities that matter enormously when a robot needs to handle objects gently or navigate uneven ground without jarring movements.
Curriculum learning
Rather than throwing the full complexity of a task at an RL agent from the start, curriculum learning gradually increases difficulty as the agent improves. A locomotion agent might first learn to stand, then walk on flat ground, then handle gentle slopes, and finally tackle rough terrain with obstacles. This staged approach dramatically improves training efficiency and helps avoid the agent getting stuck in poor local solutions early on.
Teacher-student training
A common pattern in sim-to-real RL involves a two-phase process. In the first phase, a "teacher" policy is trained in simulation with access to privileged information — perfect knowledge of terrain height, contact forces, and other data that a real robot's sensors cannot directly observe. In the second phase, a "student" policy is trained to replicate the teacher's behaviour using only the limited, noisy sensor data that would be available on the physical platform. This distillation process yields policies that perform nearly as well as the privileged teacher while relying only on realistic inputs.
Real-World Breakthroughs
Soccer-playing humanoids
Google DeepMind's soccer project is one of the most vivid demonstrations of what RL can achieve in the physical world. The team trained small humanoid robots to play one-on-one soccer using deep RL, with training conducted entirely in MuJoCo simulation before zero-shot transfer to real hardware. The resulting agents didn't just walk and kick — they learned to recover from falls, anticipate the ball's trajectory, defend their goal using their bodies as shields, and adapt their footwork to the game situation. In testing, the RL-trained robots walked 181% faster, turned 302% faster, and kicked 34% faster than robots running conventional scripted controllers.
What made this work particularly instructive was the unexpected behaviours the agents discovered. Without being told to, they learned context-dependent tactics: taking shorter steps when approaching an attacker with the ball, using wider stances for stability during defensive positioning, and developing a distinctive spinning turn that proved highly efficient for rapid direction changes. These emergent behaviours illustrate a key strength of RL — it can discover solutions that human designers wouldn't think to engineer.
Locomotion across challenging terrain
The UC Berkeley humanoid locomotion project, mentioned above, pushed the boundaries of what a learning-based controller can handle in real outdoor environments. Using a causal transformer architecture — effectively treating robot control as a sequence prediction problem, similar in spirit to large language models — the team's controller demonstrated robust walking across surfaces including loose gravel, wet mud, grass, sand, wood chips, and steep paved roads exceeding 31% gradient. Many of these surfaces were never encountered during training.
Dexterous manipulation
Walking is only half the challenge. For humanoid robots to be genuinely useful, they need capable hands. RL is increasingly being applied to dexterous manipulation — the ability to grasp, rotate, and precisely handle objects using multi-fingered robotic hands. NVIDIA's AutoMate framework, for example, used a combination of RL and imitation learning to train robots to assemble geometrically diverse parts, achieving zero-shot sim-to-real transfer with an 84.5% success rate across 20 different assembly tasks.
Sit-to-stand and everyday movements
Beyond locomotion and manipulation, researchers are using RL to tackle the full spectrum of movements a humanoid robot needs in daily life. Recent work has applied two-stage RL frameworks to teach adult-scale humanoid robots to sit down onto chairs and stand back up — a seemingly simple action that involves complex dynamics, shifting weight distribution, and the need to maintain balance throughout the transition. These capabilities are essential for robots operating in human environments, where they need to transition between different postures and modes of interaction.
The Reward Design Challenge
If RL agents learn by maximising reward, then everything hinges on designing the right reward function. This is harder than it sounds, and poor reward design is one of the most common failure modes in robot RL.
Sparse vs. dense rewards
A sparse reward — for example, giving a score only when the robot successfully reaches a destination — provides very little learning signal. The agent must essentially stumble upon success by chance before it can begin optimising. A dense reward, which provides continuous feedback based on progress toward the goal, speed, energy efficiency, and stability, gives the agent much more information to learn from. Most practical robotics RL systems use carefully designed dense reward functions that combine multiple objectives.
Reward shaping
Reward shaping adds supplementary reward signals to guide learning without changing the optimal policy. For a walking robot, shaped rewards might include bonuses for maintaining an upright posture, penalties for excessive joint torque (which strains actuators), and incentives for smooth, energy-efficient movements. Getting this balance right is a genuine art — an improperly shaped reward can produce unexpected and undesirable behaviours.
Reward hacking
RL agents are ruthless optimisers. If there is a loophole in the reward function, they will find it. A robot rewarded for forward velocity might learn to fling itself forward in an uncontrolled lunge. An agent rewarded for being near a ball might learn to vibrate in place rather than actually playing the game. In Google DeepMind's soccer project, the researchers found that without careful termination conditions, agents learned to roll along the ground toward the ball rather than walking — technically an effective strategy for scoring, but clearly not the intended behaviour.
Designing reward functions that produce desired behaviour without exploitable loopholes remains one of the field's enduring challenges. Emerging approaches include learning reward functions from human demonstrations, using large language models to help specify rewards in natural language, and hierarchical reward structures that automatically balance competing objectives.
Safety: The Non-Negotiable Constraint
When an RL agent controls a physical robot in a real environment — potentially alongside human workers — safety is not optional. A robot that explores freely in the way RL requires might try actions that damage itself, break objects, or harm people.
Safe exploration
Safe reinforcement learning is an active area of research that constrains the agent's exploration to avoid dangerous states. Approaches include constrained optimisation methods (where safety constraints are enforced alongside reward maximisation), control barrier functions that mathematically guarantee the robot will not enter unsafe regions, and Lyapunov-based methods that ensure the system remains stable throughout training.
Hardware protection
In practice, safety in physical robot RL is achieved through multiple layers of protection. Joint position and velocity limits prevent movements that could damage actuators. Force and torque limits cap the forces the robot can apply. Automated reset mechanisms detect falls and safely return the robot to a starting position. And human supervisors maintain the ability to instantly override the control policy via emergency stop systems.
The case for simulation-first training
The overwhelming emphasis on sim-to-real transfer in current research is itself partly a safety strategy. By conducting the vast majority of training in simulation, researchers avoid exposing physical robots and their surroundings to the chaotic early phases of RL training, where the agent's behaviour is essentially random. The real robot only encounters policies that have already been extensively validated in simulation — a much safer proposition than learning from scratch on hardware.
Current Limitations and Open Challenges
For all its promise, reinforcement learning for physical robots still faces substantial hurdles.
Sample efficiency. RL algorithms remain data-hungry. Training a locomotion policy can require billions of simulated time steps — feasible with modern GPU clusters, but still a significant computational expense. Improving sample efficiency would make RL practical for a wider range of tasks and organisations.
The sim-to-real gap persists. Domain randomisation and other transfer techniques have made enormous progress, but the gap between simulation and reality has not been eliminated. Tasks involving complex contact dynamics — such as manipulating soft, deformable objects or working with fluids — remain particularly challenging to simulate accurately.
Generalisation across tasks. Most RL policies today are trained for specific tasks: walking, grasping a particular object, playing soccer. A truly general-purpose humanoid robot would need to perform thousands of different tasks, and current RL approaches typically require separate training for each one. Foundation models for robotics — large, pre-trained models that can be fine-tuned for specific tasks — are a promising direction, but this remains an open research frontier.
Long-horizon reasoning. RL excels at learning reactive behaviours that operate on short time scales — adjusting balance in response to a push, for example. It is less well-suited to tasks that require planning over extended time horizons, such as navigating a cluttered room or executing a multi-step assembly process. Combining RL with higher-level planning and reasoning systems is an area of active investigation.
Interpretability. Neural network policies trained by RL are essentially black boxes. Understanding why a robot chose a particular action — critical for debugging, safety certification, and building trust with human colleagues — remains difficult. The lack of interpretability is a significant barrier to regulatory approval and deployment in safety-critical environments.
Where the Field Is Heading
The pace of progress in RL for humanoid robots is accelerating. Several trends are shaping the near-term trajectory of the field.
RL combined with imitation learning. Rather than learning entirely from scratch, robots increasingly learn a baseline behaviour from human demonstrations and then refine it using RL. This hybrid approach combines the efficiency of learning from examples with RL's ability to optimise beyond what any human demonstrator can do.
Large-scale pre-training. Inspired by the success of foundation models in language and vision, researchers are exploring whether similar approaches can work for robot control. Pre-training a large model on diverse robot experience data and then fine-tuning for specific tasks could dramatically reduce the training required for new capabilities.
Multi-task and transfer learning. Work is underway on RL systems that can learn multiple skills simultaneously and transfer knowledge between related tasks. A robot that has learned to walk could leverage that experience to learn to run, climb stairs, or carry objects far more efficiently than starting from scratch each time.
Real-world fine-tuning. Emerging frameworks such as Simulation-Guided Fine-Tuning (SGFT) use value functions trained in simulation to guide efficient exploration during real-world adaptation, requiring far fewer physical interactions than traditional approaches. This could enable robots to adapt quickly to specific deployment environments without full retraining.
Scaling up. As Goldman Sachs projects global humanoid robot shipments to reach 51,000 units in 2026, the demand for scalable, reliable RL training pipelines is growing rapidly. The field is moving from research demonstrations to industrial deployment — a transition that will demand higher standards of robustness, reproducibility, and safety.
Why This Matters
Reinforcement learning is not just one tool among many for controlling humanoid robots — it is increasingly the foundational approach. The ability of RL agents to discover novel solutions, generalise across environments, and handle the chaotic unpredictability of the physical world gives them a decisive advantage over hand-engineered controllers for real-world deployment.
The trajectory is clear: from small humanoid robots playing toy soccer in a lab, to full-size machines hiking trails and working in factories. The gap between what RL-trained robots can do in simulation and what they can do in the real world is closing rapidly. And as that gap narrows, the practical applications — in manufacturing, logistics, healthcare, domestic assistance, and beyond — become not just plausible, but inevitable.
The robots are learning. And they're learning fast.