The Fast and the Spurious
SceneDiffuser++ tackles the challenge of simulating full-length urban trips with dynamic traffic, agent behavior, and real-time scene generation.
Imagine trying to test a self-driving car on a cross-city trip—something as simple as driving from your home in Boston’s Back Bay to Logan Airport. You’d want the car to encounter intersections, changing traffic lights, pedestrians stepping into crosswalks, parked delivery trucks, and maybe even a lane closure near South Station. But now imagine trying to recreate all that digitally, inside a virtual simulation.
That’s where the real challenge begins.
Today’s best simulations for autonomous vehicles are surprisingly narrow. Many of them focus on short, pre-recorded driving “events”—say, 10 seconds of a car turning left or yielding to a pedestrian. These snapshots are useful for specific testing, but they don’t reflect what it means to drive through a living, breathing city. They don’t simulate how the world evolves beyond those few seconds. More critically, they don’t generate new, unseen situations that the vehicle might actually encounter on the road.
This is the problem the paper “SceneDiffuser++” aims to solve: building realistic, scalable, long-horizon driving simulations that behave like actual cities, not just isolated moments. In other words, instead of replaying past traffic scenes, it builds entirely new ones—block by block, minute by minute, for an entire drive.
The challenge is threefold.
First, there’s dynamic scene management. As the vehicle moves from one part of the map to another, new vehicles and pedestrians must appear and disappear at the right time and place. But how do you decide when and where to add them? Hardcoding rules doesn’t scale, especially in complex environments.
Second, there’s the matter of occlusion and visibility. Sometimes, an object is technically present but hidden behind a bus or a building. Traditional simulators can’t gracefully handle this, resulting in scenes that look either too empty or too cluttered.
Third, you need to simulate the environment’s own behavior—particularly traffic lights. As you drive through new intersections, the simulator has to decide whether a signal is red or green and how that might change based on time and surrounding traffic.
SceneDiffuser++ tackles these problems using a single, unified AI model trained end-to-end. The core method is based on what’s known as a diffusion model, a type of generative AI that builds realistic outputs by gradually improving them, step by step. If you’ve heard of how tools like DALL·E generate images from noise, you’re already familiar with the basic concept. But here, instead of images, it’s generating full traffic scenes.
To bring this to life, SceneDiffuser++ uses a clever representation of the world: everything in the scene—cars, cyclists, pedestrians, traffic lights—is encoded into a shared digital space, like a multidimensional spreadsheet, updated frame by frame. This “scene tensor” includes not just the location and speed of every object, but also whether that object should appear at each moment in time. This makes it possible to simulate not just motion, but appearance and disappearance, and even occlusion, using a single model.
Unlike older methods that required separate systems for each of these tasks—one for placing agents, one for animating them, one for simulating lights—SceneDiffuser++ does it all together. It’s trained on real-world driving data, but it doesn’t just copy it. It learns the patterns and uses them to generate new, plausible driving scenarios, without needing explicit instructions at every turn.
This approach makes the simulation flexible, scalable, and far more realistic over the long term. The result is a city-scale “world model” that can be used to test self-driving systems as if they were truly out on the road, encountering real people, real decisions, and real consequences.
To understand how well SceneDiffuser++ works, the researchers didn’t rely on vague benchmarks or artificial tests. Instead, they put it through rigorous simulations modeled on real-world driving conditions, using extended versions of actual urban maps. These weren’t cherry-picked intersections or quiet cul-de-sacs—they were large, dynamic environments where hundreds of vehicles, pedestrians, and traffic rules all come into play at once.
The test environment was built by augmenting a widely-used dataset from Waymo, one of the leaders in autonomous driving. This dataset contains real-world driving logs captured from urban and suburban areas. But for SceneDiffuser++, the researchers went a step further: they expanded the geographic coverage to allow for full-length, trip-level simulations, not just short segments. This made it possible to simulate continuous driving over longer distances, which is where most other models fall apart.
The real breakthrough came in how the model handled unpredictability—not just how traffic behaves in a moment, but how it unfolds over time. SceneDiffuser++ was tested for its ability to generate new agents—such as vehicles entering from a side street or a pedestrian stepping off a curb—at the right moment and in the right location, all without scripting. This wasn’t just about putting objects on a map; it was about making those objects behave as they would in the real world, given the context of what’s happening around them.
It was also challenged to simulate how the environment itself evolves. For example, when the simulated vehicle approached an intersection, SceneDiffuser++ had to predict whether the traffic light would change from green to red in a way that felt natural and synchronized with surrounding traffic patterns. These small but vital transitions play a big role in whether a simulation feels real or falls apart under scrutiny.
So how did the researchers know the model was working?
Rather than relying on subjective judgments like “does this look realistic?”, they introduced a set of quantitative, distribution-based metrics that compare the simulated world to actual driving data. One key metric involved tracking how many agents (cars, bikes, pedestrians) were active in the scene at any given time and how many entered or exited as the vehicle moved along its route. The goal was not just to match the average number of vehicles, but to mirror the flow and turnover you’d see in the real world.
Another metric measured speed profiles—whether the traffic flowed smoothly and naturally, or if the simulation caused vehicles to stop, start, or behave erratically. These speed patterns are critical because they affect how a self-driving vehicle perceives and reacts to others. Too many unrealistic stops or bizarre lane changes, and the model could become a liability in real-world testing.
Perhaps most impressively, the researchers evaluated how accurately the model predicted traffic light behavior over time. Rather than coding fixed light cycles, SceneDiffuser++ learned from data how lights typically change—based on time of day, road layout, and traffic density—and applied those patterns in new, unseen situations. When tested, the transitions between red, yellow, and green closely reflected the statistical behavior found in real cities.
Each of these metrics served as a diagnostic tool, identifying where the simulation performed well and where it needed refinement. By using real-world data as the gold standard, the researchers weren’t just chasing good-looking animations. They were building a testbed that autonomous vehicle developers could trust to expose edge cases, stress-test decision-making, and explore entire trip scenarios that may never have been recorded in the real world.
While the evaluation framework for SceneDiffuser++ was grounded in comparing distributions—how closely the simulator’s outputs mirrored real-world data—the research team also looked at the consequences of failure. In long simulations, even small errors can snowball. If an extra car appears in the wrong place or a traffic light mistimes its cycle, those inaccuracies can ripple through the scene, potentially throwing the entire simulation off course. To evaluate this, the researchers tracked cumulative side effects, like increased off-road behavior or collision rates, across the full duration of simulated trips.
One important test was how well the simulated agents avoided unrealistic or unsafe behavior. For instance, did vehicles drive off the road? Did they stop abruptly with no cause? Did they collide with other objects? These are not just glitches—they’re indicators of whether the simulation is structurally sound. A few errors here and there may seem minor, but when the goal is to test autonomous vehicle performance under high-stakes conditions, even rare mistakes can obscure real risks or create false positives.
Interestingly, the researchers found that while SceneDiffuser++ excelled in realism and adaptability, it sometimes introduced unintended interactions. For example, in its effort to dynamically insert new vehicles on the fly (as the simulated car moved through the city), the model could inadvertently place those vehicles in a way that blocked the ego vehicle’s path. This would increase the likelihood of a simulated crash or a forced maneuver off the road. While this might sound like a problem, it’s also an honest reflection of real-world traffic unpredictability. However, without a mechanism to let the simulated vehicle respond intelligently—such as replanning its route or adjusting its behavior—these kinds of conflicts can skew the results.
To address this, the research pointed toward a promising future direction: tighter integration between the world model and the vehicle’s planning system. In today’s setup, the simulation acts largely on its own, and the vehicle responds to it in set intervals. But if the vehicle and the world could communicate more frequently—essentially negotiating every few seconds—many of the side effects, like sudden collisions or off-road detours, could be minimized. That level of interplay would allow the simulation to reflect not just how a world changes, but how a vehicle might realistically adapt to it.
Another notable limitation was related to static objects, like parked cars. The model was slightly too eager to populate parking lots, which could lead to crowded or implausible scenes. While it didn’t majorly impact the driving experience, it signaled a need for more nuanced spatial logic: not just when to place an object, but whether it should be there based on context.
Despite these issues, the broader impact of SceneDiffuser++ is significant. It marks a shift from short-term, event-based simulations to long-term, city-scale world modeling. Instead of scripting thousands of scenarios by hand or relying solely on historical driving data, this model can generate entirely new, plausible trips that feel real enough to train and evaluate autonomous systems.
That matters, because simulation is fast becoming a cornerstone of autonomous vehicle development. Regulators want proof of safety. Engineers want to test rare edge cases. And business leaders want faster development timelines with fewer road tests. SceneDiffuser++ supports all of those goals by creating richer, more adaptive, and more data-faithful virtual cities—ones that AVs can learn from and be challenged by.
In short, it’s not just about simulating a scene anymore. It’s about simulating the world itself.
Further Readings
- Mallari, M. (2025, June 29). Route awakening. AI-First Product Management by Michael Mallari. https://michaelmallari.bitbucket.io/case-study/route-awakening/
- Tan, S., Lambert, J., Jeon, H., Kulshrestha, S., Bai, Y., Luo, J., Anguelov, D., Tan, M., & Jiang, C. M. (2025, June 27). SceneDiffUser++: city-scale traffic simulation via a generative world model. arXiv.org. https://arxiv.org/abs/2506.21976