Freeze Frame, Future Game

Saturday, July 19, 2025

Imagine watching a short video clip of a crowded street, and trying to answer this question: what will happen next? Will the cyclist swerve into traffic? Will the pedestrian step off the curb? Will the dog dart toward the crosswalk? For humans, forecasting what might unfold in the next second or two is an instinctual part of how we navigate the world. For AI systems (especially those designed to perceive video data), this kind of prediction remains a major unsolved problem.

That’s the gap the research paper “Generalist Forecasting with Frozen Video Models via Latent Diffusion” (from Google) aims to address. While recent advances in AI have produced video models that are excellent at interpreting what’s happening right now in a video feed (like identifying objects, estimating depth, or tracking motion), most of them are not equipped to predict what will happen next. This is a fundamental limitation for any system that needs to plan ahead (think of self-driving cars, warehouse robots, or drones). These systems don’t just need to “see” the current moment; they need to anticipate and react to what’s likely to happen in the immediate future.

At the heart of the issue is this: most video models have been trained to analyze still frames or short clips, but not to forecast. They are optimized for recognition, not prediction. So while a model might be able to tell you “there’s a person walking” or “a car is moving left,” it often has no built-in way to say “that person will likely cross the street” or “the car will slow down.”

That lack of foresight is a bottleneck for real-world applications. Forecasting the near future (just the next half-second or so) can make or break systems that depend on dynamic, real-time decisions. But training a new model from scratch to do this across a wide range of video tasks is expensive and complex. The research team took a different approach: can we take existing, high-performing video models and adapt them for forecasting (without retraining them entirely)?

The solution introduced in the paper is both elegant and efficient. Rather than redesigning everything from the ground up, the researchers start with a strong foundation: frozen video models. These are models that have already been trained on large datasets to understand visual scenes, and they remain “frozen” in the sense that their parameters aren’t updated. In business terms, think of it like plugging into a high-end analytics engine that’s already built, rather than starting from scratch.

To make these frozen models capable of forecasting, the researchers build a two-part system around them:

Perception readouts: They bolt on small, trainable modules (called readouts) to extract task-specific insights from the frozen model’s output. For example, these readouts might convert internal data into an estimated depth map or object track.
Latent diffusion forecaster: They introduce a special kind of generative AI model called a latent diffusion model. Instead of generating video frames directly, this model learns to predict the internal representations (the hidden features) of the frozen video model at future time steps. It’s like forecasting what the model will “see” next, then interpreting that prediction using the readouts.

The magic here is that this approach doesn’t require new labels, retraining the video model, or task-specific architectures. It’s a general, modular system that works across multiple tasks (whether it’s predicting the next few pixels in a video, or forecasting where a tracked object will move).

The result is a flexible framework that transforms any perception-focused video model into a generalist forecasting engine, one that can anticipate what’s coming next across a wide range of visual tasks.

To test how well their forecasting framework actually worked, the researchers ran a series of structured experiments. But instead of building a custom model and showcasing how well it performed on a narrow task, they did something more interesting: they used nine different pretrained video models, each built for different purposes—ranging from still-image recognition systems to advanced video-synthesis engines—and put them through the same forecasting test bench. This wasn’t about showing off the best model. It was about showing how forecasting performance depends on the quality and design of the original perception model.

Each of these frozen models was paired with the same forecasting and readout setup. The idea was to level the playing field and test one core question: If we start with a better perception model, do we get better forecasting? To answer that, they looked across four types of tasks:

Pixel-level forecasting: What does the scene look like in the next moment?
Depth prediction: How far are things from the camera?
Object tracking: Where will a moving object be next?
Bounding box prediction: How will the shape and location of a moving object evolve?

Importantly, these tasks span a range of difficulty levels and information granularity, from low-level color and shape predictions to higher-level understanding of object motion. Across all of them, the framework used the same backbone model plus the forecasting module, offering a true apples-to-apples comparison.

So how do you judge whether the forecast was any good? The researchers used a two-pronged evaluation strategy that captured both accuracy and quality in a realistic way.

To evaluate whether a forecasting system is actually useful, the research team went beyond simplistic comparisons like “Did it guess the next frame correctly?” Instead, they used a multi-layered strategy to assess how well and consistently each system predicted what happens next (across different types of tasks and levels of uncertainty).

First, they focused on per-example performance. For every video clip, the system didn’t just make one forecast; it generated multiple plausible futures (thanks to the diffusion model’s generative nature). Then it compared each of these against the actual ground truth using familiar metrics: for example, how close the predicted pixel values were to the real ones, or how well predicted object positions lined up with real trajectories.

But perhaps more insightful was their distributional evaluation. In the real world, there’s often more than one plausible future. A forecasting model that always picks just one possibility might seem “confident,” but that doesn’t make it accurate. The researchers addressed this by measuring how closely the distribution of predicted outcomes matched the actual range of possible futures. They used a tool called Fréchet Distance (essentially a way of comparing two clouds of data points) to measure how similar the overall shape of the predicted future was to the real one.

In addition, they checked for diversity in the predictions: if all ten sampled forecasts look the same, the system might be overconfident or rigid. By analyzing the spread and variance of predictions, they could determine whether the model captured the inherent uncertainty of future events, crucial for risk-sensitive applications like autonomous driving or robotics.

Together, these evaluation tools provided a robust way to measure not only the accuracy of individual forecasts, but also the model’s ability to reflect real-world unpredictability.

Despite its strengths, the proposed solution isn’t without limitations. The most immediate constraint is short time horizons: the model forecasts only about a dozen frames into the future—roughly half a second to one second, depending on frame rate. That’s useful for many tasks, but insufficient for more complex scenarios like multi-step planning or long-horizon navigation. Extending that forecasting window while maintaining realism and diversity remains a major challenge.

Second, the framework depends entirely on frozen backbones—meaning, the core video model isn’t fine-tuned for forecasting. While this approach makes the system broadly compatible and easy to scale, it may underperform in edge cases where the perception model was never optimized to represent time-sensitive features.

Another limitation is the task scope. The experiments were limited to a set of four concrete forecasting challenges. In practice, many industries need systems that predict semantic or higher-order outcomes (such as human intent, social dynamics, or physical interactions—not just object motion or scene appearance).

Yet, even with these limitations, the impact of the work is substantial. This paper introduces a general-purpose forecasting layer that can be attached to virtually any state-of-the-art video model. It shows that the ability to forecast is not just about training a bespoke system for each new problem, but about leveraging what perception models already understand (and projecting that forward intelligently).

In doing so, it creates a flexible new foundation for video intelligence systems that aren’t just reactive, but predictive. That opens doors in industries ranging from robotics and AVs to security, logistics, and even retail… anywhere that understanding what comes next is just as important as understanding what’s happening now.

Further Reading