THE EMBODIMENT GAP
Why $2.9 Trillion in Artificial Intelligence Infrastructure Cannot Build the Physical Intelligence Economy, and Who Captures the $15 Trillion That Can
By Shanaka Anslem Perera
December 29, 2025
I. THE CONFESSION BURIED IN PLAIN SIGHT
On December 18, 2025, Sam Altman sat for an interview with Big Technology and made an admission that should have stopped every capital allocator on Earth mid-sentence. Asked about OpenAI’s progress toward artificial general intelligence, the man who commands a $300 billion valuation and $1.4 trillion in compute commitments across eight years offered this assessment of his company’s capabilities: “Memory is still very crude, very early. We’re in the GPT-2 era of memory.”
The GPT-2 era. That would be 2019. Six years ago. Before the scaling laws that supposedly bent the arc of technological history. Before the $44 billion in cumulative losses. Before the largest private company fundraising in human history. Before Microsoft committed a quarter-trillion dollars to a partnership predicated on the assumption that language models were the path to machine superintelligence.
In a single sentence, Altman acknowledged what the world’s most sophisticated observers have been whispering in private channels for eighteen months: Large Language Models cannot remember. They cannot maintain persistent state. They cannot track objects through time or simulate how physical systems evolve or predict what happens when you push something off a table. They are, in the precise and devastating formulation of Yann LeCun, “wordsmiths in the dark, eloquent but inexperienced, knowledgeable but ungrounded.”
This is not a critique of ChatGPT’s chatbot capabilities. Within their domain, LLMs have achieved genuine marvel status. They write poetry that brings readers to tears. They debug code that would take human engineers hours. They synthesize information across disciplines with a fluency that makes research assistants obsolete. For digital tasks, they have earned every superlative.
But the capital markets have not valued OpenAI at $300 billion for chatbot excellence. They have valued it on the premise that language models are the substrate upon which artificial general intelligence will be built, that scaling these systems with sufficient compute and data will yield machines capable of operating in the physical world, that the path from GPT-4 to robots and autonomous vehicles and scientific discovery runs through ever-larger transformer architectures trained on ever-larger text corpora.
That premise is false. The evidence is now overwhelming. And the implications for the largest infrastructure buildout in industrial history are only beginning to crystallize in the minds of those who allocate capital at civilizational scale.
II. THE ARCHITECTURE OF INCOMPATIBILITY
To understand why trillions of dollars face reallocation, one must first understand what Large Language Models actually do at a fundamental level, stripped of the marketing language that has obscured their essential nature from public comprehension.
An LLM is an autoregressive sequence predictor. Given a string of tokens, it computes a probability distribution over what the next token should be. That is all it does. Every capability that has dazzled the world, from creative writing to mathematical reasoning to code generation, emerges from this single operation: predicting the statistically most likely next word given everything that came before.
This architecture has proven extraordinarily powerful for tasks that can be reduced to sequence completion. If you have seen enough chess games represented in text, you can predict the next move. If you have seen enough Python code, you can predict the next line. If you have seen enough human conversations, you can predict responses that feel natural and helpful. The scaling laws demonstrated that performance on these tasks improves predictably with model size and training data, and for a brief window, it appeared that all of intelligence might reduce to sequence prediction at sufficient scale.
But physical intelligence does not reduce to sequence prediction. When a robot needs to pick up a cup, it does not need to predict what word comes next. It needs to maintain a persistent internal model of where the cup is located, how its own gripper is positioned in three-dimensional space, what forces will be required to grasp without crushing, how the cup’s center of mass will shift as liquid sloshes, and what trajectory will move the cup from table to mouth without collision or spillage. This requires an entirely different cognitive architecture.
World models, in the technical sense pioneered by David Ha and Jürgen Schmidhuber in 2018 and elaborated by Yann LeCun in his 2022 framework, learn internal representations of spatial and temporal dynamics. They do not predict tokens. They predict states. They encode observations into a latent representation of the world, simulate how that representation evolves through time in response to actions, and decode predictions about future observations. They maintain the persistent memory that LLMs fundamentally cannot, track objects through occlusions, simulate physics, and support the counterfactual reasoning required for planning.
The distinction is not subtle. It is architectural. An LLM trained on all the text ever written about gravity will know that objects fall when dropped. A world model trained on video of falling objects will simulate the trajectory, calculate the impact point, and predict the bounce pattern. The LLM knows facts about physics. The world model simulates physics. For a robot catching a ball or a vehicle navigating traffic or a manufacturing system coordinating precision assembly, the difference is everything.
III. THE PROOF IS SHIPPING
For years, the world model thesis remained theoretical, a research direction pursued by those who believed LLM scaling would plateau before achieving general intelligence. That era ended in 2025. The proof is no longer confined to academic papers. It is shipping in production systems from the world’s most capable technology organizations.
Google DeepMind’s Genie 3, announced in August 2025, generates interactive three-dimensional environments at 720p resolution and 24 frames per second. Users can navigate these generated worlds in real time, and the system maintains consistency across sessions lasting several minutes, tracking approximately 200 to 300 distinct world modifications without degradation. Demis Hassabis, DeepMind’s CEO, stated unequivocally that “building world models has always been the plan for DeepMind to get to AGI.” This is not a research preview. This is a stake in the ground about the architectural path to machine superintelligence.
NVIDIA’s Cosmos platform, launched at CES in January 2025, represents the first attempt to build world models at infrastructure scale. Trained on 9,000 trillion tokens derived from 20 million hours of video and sensor data, Cosmos provides physics-aware simulation for robotics and autonomous systems. The adopter list reads like a directory of companies betting their futures on physical AI: Figure AI, 1X Technologies, Agility Robotics, Toyota, XPENG, Uber, Wayve, Microsoft, and Siemens. Jensen Huang, who has never been accused of underselling NVIDIA’s position, called this “the ChatGPT moment for robotics.” Given that NVIDIA’s data center revenue reached $115.2 billion in fiscal 2025, up 142 percent year over year, Huang’s assessment of where value creation is headed carries material weight.
Meta’s V-JEPA 2, released in June 2025, achieves 65 to 80 percent zero-shot success on robotic manipulation tasks using only 62 hours of robot-specific training data. The architecture predicts in embedding space rather than pixel space, enabling planning that runs 30 times faster than Cosmos for comparable tasks. Yann LeCun, who left Meta in December to found AMI Labs at a reported €3 billion valuation, designed V-JEPA specifically because he believes autoregressive language models are fundamentally incapable of achieving physical intelligence. His departure, and the reported €500 million fundraise to pursue world models independently, represents perhaps the most significant intellectual defection in the brief history of artificial intelligence.
World Labs, founded by Stanford’s Fei-Fei Li and valued at $1 billion within four months of its September 2024 launch, shipped Marble in November 2025. The system generates persistent, explorable three-dimensional worlds from text and images, maintaining spatial consistency across unlimited session duration. Li’s assessment of the competitive landscape could not be more direct: “LLMs remain wordsmiths in the dark, eloquent but inexperienced, knowledgeable but ungrounded. AGI will not be complete without spatial intelligence.”
Tesla’s Full Self-Driving version 14, released in October 2025, incorporates what the company calls a “neural world simulator.” According to Ashok Elluswamy, Tesla’s Vice President of AI Software, the system synthesizes high-fidelity video of the world in response to AI actions, enabling closed-loop simulation, adversarial scenario testing, and large-scale reinforcement learning. The same architecture powers both FSD and Optimus, Tesla’s humanoid robot. With 5 million vehicles collecting real-world driving data and 7 billion accumulated FSD miles, Tesla possesses approximately 1,000 times more embodied training data than any competitor.
These are not research demonstrations. They are production systems from organizations that collectively command trillions of dollars in market capitalization, built by teams that include multiple Turing Award laureates, shipping to customers who are betting their companies on physical intelligence. The debate about whether world models are necessary for embodied AI is over. The only remaining questions are how fast the transition happens and who captures the value.
IV. THE ARITHMETIC OF MISALLOCATION
Morgan Stanley’s research division projects that cumulative global investment in artificial intelligence infrastructure will reach $2.9 trillion by 2028. This figure encompasses $1.6 trillion in hardware, primarily GPUs and specialized accelerators, and $1.3 trillion in physical infrastructure including data centers, power generation, and cooling systems. The buildout is proceeding at a pace that dwarfs previous technology cycles, with hyperscaler capital expenditure expected to reach $4 trillion by 2030.


