Why LeCun is Betting on World Models (and Why Builders Should Pay Attention)
At Nvidia’s GTC 2025, Yann LeCun said something that cut through the noise:
“I’m not so interested in LLMs anymore.”
TheAIGRID has a strong breakdown of his remarks. The takeaway is clear: LLMs have reached the point of diminishing returns. Scaling them further is useful, but it will not unlock the next stage of intelligence.
The Four Frontiers
LeCun outlined four capabilities that define the real challenge ahead:
Grounded understanding of the physical world
Persistent memory
Reasoning
Planning
These are the exact gaps builders encounter when trying to turn LLM demos into production-grade systems.
Where Tokens Break Down
Tokens work well for language, but they do not map neatly to the continuous and unpredictable nature of the real world.
Discretization hides the richness of physical dynamics
Breaking the world into fixed tokens oversimplifies reality. Physical processes are continuous and fluid, not neatly bucketed.
Example: A robot arm pouring liquid cannot be reduced to a finite set of “token” movements. The flow and slosh of water are continuous, and discretization misses those dynamics.Predicting pixels wastes resources on details that are unknowable
Models spend capacity trying to guess noise, like the exact shape of someone’s face in the next video frame. The important structure gets lost in the effort.
Example: If a model tries to predict every pixel of a crowded street scene, it wastes energy guessing who will turn their head left or right, instead of focusing on the physics of cars moving through traffic.Next-token prediction does not translate into reliable next-action prediction
Language fluency does not guarantee grounded decisions. Predicting the “right word” is not the same as predicting the “right move” in a real environment.
Example: An LLM can generate a perfect recipe description but cannot guarantee the cake will rise. The transition from text to action requires grounding in physical cause and effect.
LeCun’s answer is Joint Embedding Predictive Architectures (JEPA / V-JEPA). These architectures predict in latent space: abstract representations that capture what matters, ignore what does not, and reveal when the world violates learned patterns.
System 1 and System 2
Psychologists describe two modes of thought:
System 1: fast, reactive, policy-driven
System 2: slower, deliberate, planning-oriented
LLMs approximate System 1. System 2 requires models that can manipulate abstract representations of the world and plan within them. That is the gap LeCun is targeting.
The Data Imbalance
Consider the scale of human sensory input versus text:
LLMs train on ~30 trillion tokens (~10^14 bytes)
A four-year-old child sees ~10^14 bytes through vision alone in just four years
The conclusion is direct: text-only training cannot match the richness of multimodal world experience.
Implications for Builders
This research agenda has immediate consequences for how we design products today:
🧠 Memory as a product surface
Create explicit, structured memory for entities, states, and goals.
⚙️ Two operating loops
Planning in latent space, then distilled policies for fast execution.
📊 Rethink evaluation
Track latent prediction accuracy, surprise recovery, and policy efficiency instead of surface-level fluency.
💬 Tactical use of LLMs
Use them where they excel: language, orchestration, and code generation. Connect them to models that represent real-world state and dynamics.
A 90-Day Blueprint
You do not need a research lab to explore this direction.
Weeks 1–2
Select a task with real dynamics. Define a compact state representation.
Weeks 3–6
Train a self-supervised encoder and predictor to model transitions. Measure plausibility rather than pixel fidelity.
Weeks 7–10
Layer in a planner to act over the latent space. Track when and how the system recovers from surprises.
Weeks 11–12
Distill the planner into a lightweight policy. Add a “surprise guard” that triggers re-planning when prediction errors spike.
The Bigger Picture
LeCun’s shift away from LLMs signals where the field is heading. Language interfaces will remain essential. Progress depends on systems that can model the world, remember, reason, and plan.
The open challenge for builders:
👉 Where in your product could latent-space planning outperform a larger prompt?
How to Spot If Your Product Is Hitting LLM Limits
If you are building with LLMs today, here are clear signs you are running into structural limits rather than prompt tuning problems:
Forgetting across sessions
Users expect the system to remember entities, states, and goals. Your product relies on hacks like re-prompting or long context windows.Slow or expensive reasoning
Tasks require multi-step reasoning chains, but the model burns cost and latency on re-generating context every time.Unreliable action prediction
The model produces fluent text but fails to reliably trigger the correct action in dynamic or stateful environments.Evaluation mismatch
Your metrics focus on fluency or token accuracy, but users care about planning quality, adaptation speed, and success rates.Scaling without leverage
Bigger prompts and bigger models give marginal gains. What you need is structured state, not more tokens.
If you see these patterns, it is time to rethink your architecture. Keep the LLM as the language interface, but pair it with a world model that can represent state, reason in latent space, and plan actions. That is the step change LeCun is pointing towards.
💡 In my next post I will go deeper into practical patterns: concrete examples of how to approximate world models today using four simple building blocks, for example, entity and state memory, transition functions, LLM-assisted reasoning, and simulation by replay. Each with real use cases you can start testing immediately.