
The argument that true intelligence may require physical interaction with the world — and what that means for humanoid robotics.
For decades, the dominant paradigm in artificial intelligence treated intelligence as a purely computational affair. The brain was software; the body was incidental hardware. Feed an algorithm enough data, give it enough processing power, and intelligence would emerge — no limbs, no skin, no contact with the physical world required. Large language models, chess engines, and image classifiers seemed to confirm this view: disembodied systems achieving superhuman performance in one narrow domain after another.
But a growing body of evidence from robotics, cognitive science, neuroscience, and philosophy is challenging that assumption. Embodied AI — the idea that intelligence is shaped by, and may ultimately require, physical interaction with an environment — has moved from a fringe philosophical position to one of the most active frontiers in robotics research. And it sits at the very heart of why companies are investing billions in building humanoid robots rather than simply making smarter chatbots.
The Embodiment Hypothesis: Where the Idea Began
The intellectual roots of embodied AI stretch back further than most people realise. In the mid-twentieth century, the French philosopher Maurice Merleau-Ponty argued that consciousness and cognition are inseparable from the body. In his landmark 1945 work Phenomenology of Perception, he made the case that our understanding of the world is not constructed from abstract representations but from the lived experience of having a body that moves through, touches, and is touched by its environment. Perception, in his view, is not a passive process of data collection — it is an active, embodied dialogue with the world.
These ideas remained largely confined to philosophy departments until the late twentieth century, when roboticist Rodney Brooks at MIT began putting them into practice. In the late 1980s, Brooks rejected the then-dominant approach to AI — the "sense-model-plan-act" paradigm, in which a robot builds an internal symbolic model of the world and reasons over it before acting. Instead, Brooks built robots with what he called a subsumption architecture: simple sensor-to-behaviour connections that allowed machines to react directly to their environment without constructing elaborate internal maps. His robots were not brilliant thinkers, but they could navigate cluttered rooms, avoid obstacles, and adapt to changing surroundings in ways that Shakey, a ponderous early AI robot from the 1960s, never could.
Brooks's work demonstrated a principle that philosophers had been arguing for decades: that intelligence does not have to live exclusively inside the head. It can be distributed across brain, body, and environment — emerging from the interaction between all three.
The formal "embodiment hypothesis" was later articulated by developmental psychologist Linda Smith in 2005, proposing that thinking and learning are fundamentally influenced by the ongoing interactions between a body and its surroundings. This built on earlier work by Francisco Varela, Evan Thompson, and Eleanor Rosch, whose 1991 book The Embodied Mind helped bridge the gap between phenomenology and cognitive science, arguing that organisms actively "enact" their worlds through bodily engagement rather than passively representing a pre-existing reality.
Moravec's Paradox: Why Easy Things Are Hard
Perhaps the most intuitive illustration of why embodiment matters comes from Moravec's paradox — an observation made independently in the 1980s by roboticist Hans Moravec, Rodney Brooks, and AI pioneer Marvin Minsky. The paradox, as Moravec stated it, is that the tasks humans consider intellectually demanding — playing chess, solving algebraic equations, scoring well on IQ tests — are comparatively easy to implement in machines, while the tasks we consider trivially easy — walking across a room, recognising a face, catching a ball — are extraordinarily difficult.
The explanation, Moravec argued, lies in evolutionary history. Sensorimotor abilities such as perception, balance, and locomotion have been refined by hundreds of millions of years of natural selection. They are implemented by the brain's largest and most computationally dense neural structures. Abstract reasoning, by contrast, is an evolutionary novelty — perhaps only a hundred thousand years old — and its neural machinery is correspondingly rudimentary. We find abstract reasoning hard precisely because it is poorly optimised by evolution. Machines find sensorimotor tasks hard because those tasks encode a billion years of accumulated biological intelligence that we have barely begun to understand, let alone replicate.
This paradox is not merely a curiosity. It explains why industrial robots conquered factories decades ago — performing repetitive, precisely defined motions in tightly controlled environments — but still cannot fold a towel, load a dishwasher, or navigate a cluttered living room. It explains why a chatbot can pass a law exam but cannot pour a glass of water. And it explains why the current wave of embodied AI research feels so significant: for the first time, advances in hardware, simulation, and machine learning are making it possible to tackle the sensorimotor side of intelligence at scale.
What Disembodied AI Cannot Do
Consider what happens when you pick up a mug of coffee. You don't consciously calculate its weight, estimate the friction coefficient of the handle, or model the fluid dynamics of the liquid inside. Yet your hand adjusts its grip force dynamically as you lift, compensates for the shifting weight of the liquid as you move, and modulates pressure with exquisite precision to avoid both dropping the mug and crushing it. You integrate visual information, tactile feedback, proprioceptive awareness of your hand's position in space, and predictions about the mug's behaviour — all in real time, all without conscious thought.
A disembodied AI — no matter how sophisticated — cannot learn this skill from text or images alone. It can describe how to hold a mug. It can identify a mug in a photograph. But without a body that interacts with physical objects, it has no way to learn the feel of grip force, the sensation of a shifting centre of mass, or the fine-grained sensorimotor feedback loop that makes the action possible. The knowledge is, in a fundamental sense, bodily knowledge.
This is not a trivial limitation. The physical world is governed by a kind of complexity that resists purely computational approaches. Objects have friction, mass, flexibility, and texture. Surfaces are uneven. Lighting changes. Other people and animals move unpredictably. An AI system that has only ever processed data about the world — rather than acting within it — lacks the grounded understanding needed to operate reliably in unstructured environments.
As robotics researcher Sami Haddadin of the Technical University of Munich has noted, the physical world is fundamentally unpredictable, and an AI that exists within it must account for a constant state of change. There is no fixed solution to embodied AI, he argues, because the environment, like the learning process itself, is never fully explored and constantly evolving.
The Learning Advantage of Having a Body
One of the most compelling arguments for embodied AI is that bodies are not just tools for executing commands — they are instruments for learning. A body provides something that no dataset can fully replicate: real-time, closed-loop sensory feedback from interactions with a physical environment.
When a robot attempts to walk and stumbles, that failure generates a rich stream of data — joint torques, accelerometer readings, contact forces, visual displacement — that informs the next attempt. Over thousands of iterations, the robot develops a walking policy not from abstract principles of bipedal mechanics but from direct experience of what works and what doesn't in a specific body interacting with a specific surface. This is fundamentally different from learning to walk in the abstract.
The same principle applies to manipulation. A robot learning to grasp objects discovers through physical trial and error that a soft tomato requires a different grip strategy than a rigid metal bolt. It learns that transparent objects confuse vision systems and must be located partly through touch. It discovers that stacking objects demands not just positional accuracy but an understanding of weight distribution and friction — knowledge that is difficult to encode in advance but emerges naturally from embodied practice.
This learning-through-doing has a name in cognitive science: sensorimotor coupling. Perception and action are not sequential steps in a pipeline but tightly intertwined processes that inform each other continuously. The act of reaching for an object is also an act of perceiving it — the motor command shapes the sensory experience, and the sensory experience shapes the next motor command. This coupling is one of the key reasons that embodied systems can develop a kind of adaptive fluency that pure software systems struggle to achieve.
Multimodal Learning and the Richness of Physical Experience
Embodied AI also benefits from what researchers call multimodal learning — the integration of information from multiple sensory channels simultaneously. A human handling an unfamiliar object combines vision (its shape, colour, and texture), touch (its surface roughness, temperature, and weight), proprioception (the position and effort of the fingers), and sometimes even sound (does it rattle? Is it hollow?) into a unified percept. Each modality provides information the others cannot, and the combination is far richer than any single channel alone.
This is precisely what embodied AI systems are beginning to replicate. Modern humanoid robots are equipped with cameras, force/torque sensors, tactile arrays, inertial measurement units, and joint encoders — all feeding data simultaneously into neural networks that learn to fuse these streams into coherent representations of the world. A robotic hand learning to assemble components does not rely solely on camera images; it also senses the resistance and weight of parts as it works, detects misalignments through force feedback, and adjusts its grip based on tactile contact patterns.
This multimodal richness is one of the strongest arguments for embodiment. A system that can see, touch, and move through the world has access to a fundamentally different — and arguably deeper — kind of understanding than one that can only read about it or look at pictures.
Foundation Models Meet Physical Bodies
The current excitement around embodied AI is driven largely by a convergence: the rapid maturation of foundation models (large language models, vision-language models, and their multimodal successors) combined with increasingly capable robot hardware and high-fidelity simulation environments.
For years, foundation models in robotics relied primarily on vision-language pre-training — transferring the semantic understanding of large multimodal models into robot control systems. This allowed robots to understand natural-language instructions and recognise objects, but it did not solve the deeper problem of physical interaction. A robot could understand the command "pick up the red mug" but still struggled with the mechanics of actually doing so.
A new generation of models is attempting to close this gap. Vision-language-action (VLA) models are trained not just on text and images but on physical interaction data — recordings of robots grasping, lifting, pouring, assembling, and navigating. These models aim to learn the kind of sensorimotor knowledge that has traditionally been the province of handcrafted controllers and narrow reinforcement learning policies.
Companies such as Physical Intelligence, Google DeepMind, and Generalist AI are building what they call embodied foundation models — large-scale systems trained on diverse real-world manipulation data spanning homes, warehouses, factories, and other environments. Generalist AI's GEN-0, announced in late 2025, is trained on what the company describes as orders of magnitude more real-world manipulation data than existing public robotics datasets, covering tasks from peeling potatoes to threading bolts. The architecture is designed to capture not just visual understanding but physical common sense — the intuitive grasp of how objects behave when pushed, stacked, poured, or dropped.
Meanwhile, robotics-specific multimodal models like RoboBrain and Gemini Robotics are being designed to tightly couple perception, affordance understanding, and long-horizon reasoning — the ability to plan sequences of physical actions that unfold over minutes rather than milliseconds. This is a critical capability for humanoid robots, which must not only execute individual manipulation tasks but chain them together into coherent, goal-directed behaviour in complex environments.
Simulation: Building Bodies in Virtual Worlds
One of the practical breakthroughs enabling embodied AI is the development of high-fidelity physics simulation platforms that allow robots to train in virtual environments before being deployed in the real world. This approach — often called sim-to-real transfer — addresses one of the fundamental bottlenecks in robot learning: the physical world is slow, expensive, and unforgiving. A robot that learns by trial and error in the real world risks damaging itself, its surroundings, or the people nearby.
Simulation sidesteps these constraints. Platforms like NVIDIA's Isaac Sim and Isaac Lab, the open-source MuJoCo physics engine, and PyBullet allow researchers to run thousands of robot instances in parallel, each exploring different strategies for walking, grasping, or navigating — all at speeds far exceeding real time. NVIDIA's Isaac Lab, for example, combines GPU-accelerated physics, photorealistic rendering, and support for multiple robot types — including humanoids, manipulators, and mobile robots — within a single extensible framework.
The "sim-first" approach has become something close to standard practice in humanoid robotics. Tesla's Optimus, NVIDIA's GR00T humanoid platform, and numerous startups train their robots extensively in simulation before transferring learned behaviours to physical hardware. The challenge — known as the sim-to-real gap — is ensuring that policies learned in the clean, deterministic world of simulation transfer reliably to the noisy, unpredictable physical world. Techniques such as domain randomisation (deliberately varying simulation parameters to expose the robot to a wide range of conditions) and mixed sim-real training pipelines (combining synthetic and real-world data) are helping to narrow this gap.
In a striking example of how far these methods have come, a recent healthcare robotics project using NVIDIA Isaac for Healthcare demonstrated a full pipeline from simulation to deployment on real surgical-assistant hardware, with over 93 per cent of the training data generated synthetically in simulation. This kind of data efficiency — achieving real-world competence primarily from virtual experience — is a major step towards scalable embodied intelligence.
Why the Humanoid Form?
If embodiment is the key, why specifically a human-shaped body? The argument is both practical and philosophical.
The practical case is straightforward: the human world is built for human bodies. Door handles are at human hand height. Stairs are sized for human legs. Tools are designed for human grips. Workstations, vehicles, kitchens, and hospitals all assume a roughly human-shaped operator. A humanoid robot can, in principle, operate in any environment designed for people without requiring expensive modifications to the infrastructure. A wheeled robot cannot climb stairs. A quadruped cannot open a door handle. A fixed robotic arm cannot walk between rooms. The humanoid form factor is, in essence, the most general-purpose body plan for operating in human environments.
The philosophical case goes deeper. If intelligence is shaped by embodiment — if the kind of body you have influences the kind of intelligence you develop — then a human-like body may be the most direct path to human-compatible intelligence. A robot that interacts with the world through hands, arms, and bipedal locomotion may develop representations and strategies that are more naturally aligned with human cognition than those of a radically different body plan. This is not to say that non-humanoid robots are unintelligent, but that the intelligence they develop will be fundamentally shaped by their form — just as the intelligence of a bird is shaped by flight and the intelligence of an octopus is shaped by its flexible, distributed body.
This idea has deep roots. Hubert Dreyfus, the philosopher most associated with critiques of classical AI, argued throughout his career that human-level intelligence emerges from human-level embodiment — that neural networks trained on sensory-motor data could not develop human-like understanding until they were housed in bodies with a structure like ours. The form of the body, in this view, is not incidental to intelligence. It is constitutive of it.
The Challenges Ahead
For all its promise, embodied AI remains extraordinarily difficult. Several fundamental challenges persist:
- The sim-to-real gap — Policies trained in simulation still frequently fail when confronted with the messiness of the real world: unexpected textures, lighting conditions, object properties, and human behaviour. Closing this gap completely remains an open research problem.
- Long-horizon reasoning — While current systems can perform individual manipulation tasks with increasing reliability, chaining these into extended, multi-step behaviours — making a meal, tidying a room, assisting with a medical procedure — requires a level of planning and adaptation that remains largely unsolved in unstructured settings.
- Sample efficiency — Despite advances in simulation, training embodied systems remains extraordinarily data-hungry. Biological systems learn from remarkably few examples (a child learns to catch a ball in a handful of attempts), while robots typically require thousands or millions of trials. Narrowing this gap is a major focus of current research.
- Safety and robustness — A robot that operates in close proximity to humans must be not just capable but predictably safe. The unpredictability of learned policies — particularly those trained through reinforcement learning — creates challenges for certification and deployment in sensitive environments like healthcare, elder care, and domestic settings.
- Energy and hardware constraints — Running large foundation models on board a mobile robot, in real time, with limited battery power, remains a significant engineering challenge. Edge computing, model compression, and specialised AI accelerator chips are all active areas of development.
What Embodied AI Means for the Future of Humanoid Robotics
The embodied AI thesis has profound implications for the trajectory of humanoid robotics. If it is correct — if truly general, adaptable intelligence requires physical engagement with the world — then building better chatbots and better robots are not separate enterprises. They are two faces of the same problem.
The current generation of humanoid robots — from Boston Dynamics' Atlas to Tesla's Optimus, from Figure's Figure 02 to Agility Robotics' Digit — are, in this framing, not just engineering projects. They are experiments in embodied intelligence: platforms for testing whether physical interaction can produce the kind of robust, adaptable, general-purpose intelligence that disembodied AI has so far failed to deliver.
The stakes are significant. If embodied AI delivers on its promise, the implications extend far beyond robotics. It would suggest that the path to more capable AI runs not through ever-larger language models alone but through the integration of those models with physical bodies — machines that can learn by doing, adapt through experience, and develop an intuitive understanding of the physical world that no amount of text training can provide.
Whether embodied AI proves to be the key that unlocks general-purpose robotics — or merely one important piece of a much larger puzzle — it has already reshaped how researchers and engineers think about intelligence itself. The body, it turns out, is not just a vehicle for the mind. It may be a necessary condition for it.