Watch and Learn? This AI Actually Does

Thursday, June 12, 2025

What if an AI could learn the physical rules of the world just by watching YouTube?

That’s the provocative question behind V-JEPA 2, a new research breakthrough from Meta. The project tackles a deep and long-standing weakness in artificial intelligence: most AI systems don’t actually understand the world—they memorize patterns. That’s why even the most advanced models often crumble when asked to do something unfamiliar in the real world, like moving a robot arm to grab an object they’ve never seen before, or predicting what will happen if someone knocks over a glass on a table.

The core problem here is generalization. While humans can learn a huge amount just by watching others—whether it’s a baby observing how gravity works or an engineer shadowing a factory floor—AI systems typically need extensive labeled data and detailed instructions to perform well. This makes them expensive to train, narrow in scope, and hard to scale. Want to teach a robot how to fold laundry? You’ll need to show it thousands of hand-annotated examples. Want it to adapt to a new brand of detergent bottle? Start from scratch.

V-JEPA 2 tries to flip that model on its head. Instead of teaching an AI by feeding it labeled examples of every task, the researchers ask: What if we could build an AI that first learns a general-purpose understanding of the physical world—just by watching—and then quickly adapts that knowledge to act in new situations with minimal extra data?

That’s the central idea: learn first by observing everything, then act smartly with almost nothing.

To do this, V-JEPA 2 introduces a two-part system:

First, it trains a “world model” by watching over a million hours of unlabeled video—from internet clips to surveillance-style footage. Importantly, it doesn’t just try to recognize objects or actions like previous video models. Instead, it’s optimized to predict what happens next in a scene. For example, if a ball rolls off a table, the model learns to anticipate that it will fall to the floor—even if it’s never been explicitly told what gravity is. This is called self-supervised predictive learning: the model teaches itself by trying to forecast future video frames based on the past.

This predictive training forces the AI to internalize basic dynamics of motion, cause and effect, and object interactions—not just appearances.

Second, after learning these general visual dynamics, the model is connected to other systems for more specific tasks. For instance, it’s linked to a large language model so it can answer natural-language questions about what’s happening in a video. It’s also extended with a lightweight control module that takes in robot actions and predicts their impact on the visual world. In other words, it can now imagine what will happen if a robot moves left or picks something up, even if it hasn’t been trained on that specific motion before.

This two-part method—general world learning first, task adaptation later—is both novel and surprisingly efficient. By separating learning from doing, it opens the door to AI that can handle entirely new challenges with very little task-specific tuning. It’s a major departure from the “one-task, one-dataset” approach that has dominated AI development for the past decade.

And the bet V-JEPA 2 makes is a bold one: if AI can watch long enough, it might not just learn to see—it might learn to understand.

To find out whether this idea actually works in practice, the researchers behind V-JEPA 2 put their system through a rigorous series of tests—spanning everything from watching videos of people cooking to actually piloting real-world robots. Their goal wasn’t just to check whether the AI could recognize what it was seeing, but to evaluate whether it had developed a genuine understanding of how things in the world unfold and how actions influence outcomes.

To start, they tested the model on its ability to make sense of motion. One benchmark involved short video clips where people perform common physical actions—like pouring water, opening drawers, or flipping switches. The model wasn’t given labels for these actions. Instead, it had to fill in the blanks: what would happen next? Would the drawer be open or closed? Would the water spill? Success here meant correctly predicting the near-future sequence of frames, not simply identifying objects or gestures.

In another set of experiments, the model had to anticipate human behavior in longer, more complex videos. For example, in footage from a kitchen setting, it was asked to guess what action might come next based on what had just happened. Would the person chop an onion, reach for a pan, or turn on the stove? This kind of task tests the model’s ability to capture not just raw movement, but the intent and flow of human activities.

To see how broadly the model’s understanding could be applied, the researchers connected V-JEPA 2 to a language interface and asked it to answer questions about video clips—things like “Why did the cup fall?” or “What happened just before the lights turned off?” Unlike traditional video question-answering models, which are trained on curated examples and specific prompts, V-JEPA 2 was working off its general world knowledge. The evaluation here was based on how well its responses aligned with human judgment.

But perhaps the most striking demonstration of V-JEPA 2’s capabilities came from its performance in robotics. In this final set of experiments, the researchers gave the model a new kind of challenge: take what you’ve learned from internet videos and apply it to controlling a robotic arm—without being trained on that task. The setup involved a robot tasked with picking up and moving objects based solely on visual goals (like an image of the target arrangement). V-JEPA 2 had never seen these scenes or objects before. It had to plan what the robot should do based entirely on its prior understanding of how actions affect the world visually.

This ability to act—without direct instruction or hand-crafted reward functions—is a major leap. To evaluate whether it worked, researchers observed whether the robot could complete its tasks correctly (e.g., pick up the right item, place it in the intended location), and how reliably it could do so across different environments and objects. The key success metric wasn’t perfection, but generalization: could the system make reasonable, effective decisions in new settings, even with minimal or no extra training?

By comparing its performance to specialized systems that were trained explicitly for each task, the researchers were able to assess whether V-JEPA 2 was truly learning something more fundamental—and transferable. And across multiple domains, it consistently demonstrated that it could take its predictive, observational training and turn it into useful, grounded actions.

In short, V-JEPA 2 wasn’t just tested on its ability to see what’s happening—it was evaluated on whether it could think ahead, adapt, and take action. That’s a far tougher—and far more interesting—standard for success.

While the experimental results clearly showed that V-JEPA 2 could learn from raw video and apply that understanding in new ways, the researchers knew that a few successful demos weren’t enough. Real evaluation meant going beyond the novelty of performance and asking harder questions: When does the model fail? Why? And how often?

That’s where the team dug deeper. They looked not only at whether V-JEPA 2 could complete a task, but at how reliably it did so—under different conditions, with new types of inputs, and with unseen variations. Success wasn’t defined by getting it right once, but by showing consistency across edge cases: videos with unusual lighting, strange object shapes, or unexpected human behavior. They also looked at whether the model’s internal representations—the “mental map” it builds of the world—actually improved decision-making, or just mimicked surface-level patterns.

Another form of evaluation involved comparing V-JEPA 2 to narrowly trained systems. In many AI workflows today, companies invest heavily in specialized models tailored to individual use cases: one model for detecting objects in a warehouse, another for guiding a robot arm, another for answering video questions. These systems tend to work well in the scenarios they were built for—but poorly anywhere else. So the researchers tested whether V-JEPA 2 could match or exceed these task-specific models without needing to be retrained for each domain. In many cases, it did. That was a strong signal that something deeper—something more general—was being learned.

Of course, no system is without its limits. V-JEPA 2 still inherits many of the biases found in internet video. Most of what it learns comes from common perspectives: human eye-level views, predictable lighting, familiar objects. It doesn’t yet understand rare events, unfamiliar environments, or non-human viewpoints very well. That limits its performance in fields like agriculture, underwater robotics, or drone operations in extreme terrain—contexts where the training data doesn’t look much like the real-world inputs.

Another limitation is scale. Training on a million hours of video—and running the resulting model—requires substantial computing power, which remains out of reach for many smaller labs or startups. That raises important questions about access and equity: who gets to build world models, and who gets locked out?

Still, the potential upside is massive. If this kind of generalist AI can be made reliable and affordable, it could dramatically reduce the need for labor-intensive data labeling and narrow-purpose engineering. For robotics, this means systems that can adapt on the fly—whether it’s in a warehouse, a hospital, or a disaster zone. For digital applications, it means AI that can understand and predict human behavior more naturally, across video, games, or augmented reality.

Looking ahead, future work on V-JEPA-style systems is likely to focus on closing the gap between observation and causality. The current model is excellent at forecasting what’s likely to happen next—but still relatively weak at inferring why it happened. That kind of deeper causal reasoning will be essential if we want these systems to operate safely in the open world, reason through alternatives, and make decisions under uncertainty.

Still, even at this stage, the implications are profound. V-JEPA 2 offers a glimpse of a new paradigm: one where AI doesn’t just memorize, doesn’t just mimic, but learns to model the world in ways that are adaptable, scalable, and increasingly human-like. It won’t replace task-specific engineering tomorrow—but it may reshape how we think about AI development in the years to come.

Further Reading