Game of Models

Thursday, June 5, 2025

If you’ve ever played a video game and thought, “This character is kind of dumb,” you’re not alone. Whether it’s an enemy walking off a cliff or a teammate standing motionless during a firefight, these in-game characters (called non-player characters or NPCs) often feel lifeless or mechanical. Behind the scenes, most NPCs still rely on pre-scripted behaviors (if you do X, the game responds with Y). But as game worlds grow more complex (from sandbox-style cities to real-time battlefields), game developers and tech companies are turning to something more powerful: Large Language Models (LLMs).

LLMs (like ChatGPT, Claude, or Gemini) can “think” through dialogue, plan multi-step strategies, and even adapt to unpredictable situations. They’re being used across industries, not just gaming. Retailers use them to handle customer service workflows. Banks use them to monitor risk. Logistics companies use them to manage supply chains. In all these cases, we’re relying on LLM “agents” to make decisions in fluid, messy environments (not just answer trivia questions). But here’s the catch: we don’t have a good way to measure how well these LLM agents actually perform in the real world.

That’s the problem the researchers of Orak (from NVIDIA and UW-Madison) set out to solve. These researchers argue that current benchmarks for LLMs are too simplistic or unrealistic. Most existing tests put LLMs in either abstract environments (like board games or text puzzles) or ask them to generate static content (like code or emails). But real-world decision-making—whether in a business workflow or a video game—requires juggling goals, adapting to change, and working with incomplete information. The Orak team believes that modern video games, with their rich visuals, unpredictable elements, and real-time constraints, are the perfect proxy to test how “agentic” an LLM really is.

To do this, the team built Orak, a framework designed to test LLM agents across 12 commercially available video games. These span a wide range of genres—from puzzle games like 2048, to fast-paced action games like Street Fighter III, to long-term strategy games like StarCraft II. Each game brings its own challenges: quick reflexes, strategic planning, storytelling, or resource management. The idea is that by performing well across all these games, an LLM demonstrates general decision-making skill that might transfer to real-world applications.

To connect these games to the LLMs, the team created a plug-and-play interface called the Model Context Protocol (MCP). Think of MCP as a translator between the game and the LLM. It feeds the LLM information about the game state—like the character’s location, the enemy’s moves, or the current score—and then waits for the LLM to decide what action to take next. This setup mimics how AI agents operate in many business settings: they observe, reason, and act in loops.

But the team didn’t stop there. They also added optional “agentic modules” that LLMs could use, such as memory (to recall what just happened), reflection (to critique past decisions), and planning (to map out future actions). This let the researchers test whether layering in more cognitive structure actually helps performance—or, in some cases, gets in the way.

Finally, they created a massive dataset of “expert gameplay trajectories” by using top-performing LLMs to play the games and explain their reasoning step-by-step. These logs are like coaching tapes: other, smaller models can study them to learn how to play better through fine-tuning.

Altogether, Orak isn’t just about games—it’s about creating a high-stakes, high-fidelity proving ground for LLM agents. It gives researchers, developers, and companies a standardized way to test how capable these agents really are when faced with long-term, high-pressure decisions.

Once the Orak framework was up and running, the team set out to put it to the test. They selected twelve of today’s most widely used LLMs, including both proprietary models like GPT-4 and Claude, and open-source alternatives like LLaMA and Mistral. Each model was evaluated across all twelve games in the benchmark, and the research team ran dozens of experiments to see how these agents fared under different circumstances.

What makes these experiments especially insightful is that the researchers didn’t just look at whether an agent could “win” a game. They tested the models in different configurations—sometimes with added memory or planning abilities, sometimes stripped down to the basics. This let them isolate what kinds of agentic reasoning actually helped, and where it might get in the way.

Interestingly, they discovered that different models benefit from different kinds of support. Some high-end models performed best when left to their own devices, handling gameplay end-to-end without relying on extra tools like memory or planning modules. For those models, adding more structure sometimes led to overthinking or delays in decision-making. But mid-range models saw noticeable improvements when given access to these agentic add-ons. The planning module, for example, helped certain models map out steps to achieve goals across longer time horizons—especially useful in games that reward strategic foresight, like StarCraft II or Darkest Dungeon.

They also ran experiments that varied how the game’s environment was represented to the models. In some cases, the LLMs were given text descriptions of what was happening on screen. In others, they received visual input in the form of images. This part of the research showed clear differences in how each model handled multimodal inputs. Some LLMs excelled when given structured, text-based game states, but struggled to interpret raw visual information. Others proved more flexible, able to process images and make decisions in games with more complex visuals. This suggests that not all “multimodal” AI is created equal, and different use cases may require different types of model strengths.

Beyond just testing the models, the team also looked at what happens when you teach them to play better. They used their gameplay dataset—essentially a library of expert moves and decision-making sequences—to fine-tune smaller models. After training on these expert examples, the fine-tuned models significantly improved their performance, even in unfamiliar scenarios. This is important because it shows that gameplay knowledge is transferable. Once an agent learns how to succeed in one situation, that knowledge can generalize to new levels or even new versions of a game.

To measure success, Orak relied on a combination of in-game performance metrics and comparative analysis. Each game had its own benchmarks—score, level completion, win/loss outcomes—depending on the genre and mechanics. These weren’t arbitrary; they reflected the same standards a human player would use to judge success. In some games, LLMs competed directly against one another in head-to-head matches. In others, performance was measured in isolation and then ranked across all submissions.

Crucially, the evaluation also included tests for robustness. Models were pushed into unfamiliar game levels or faced new character setups they hadn’t seen during training. This kind of “out-of-distribution” testing was used to mimic real-world business conditions, where an AI agent can’t just memorize past scenarios—it has to adapt on the fly. If a model performed well in these new contexts, it was considered more generalizable and trustworthy.

Altogether, the experiments didn’t just test for high scores—they evaluated flexibility, reasoning, learning, and adaptability. That’s what makes Orak a meaningful benchmark: it goes beyond performance in isolated tasks and instead asks, “Can this agent operate well under uncertainty, over time, and with real consequences?”

What made Orak especially compelling wasn’t just the games or the performance metrics—it was how the team thought about what success really means for an LLM agent. While traditional benchmarks might stop at measuring whether a model gets the answer “right,” Orak leaned into a more nuanced view: success is about consistent decision-making, adaptability to change, and robustness under pressure.

One way they explored this was through ablation studies. In simple terms, the researchers systematically removed certain features—like the memory module or the planning module—to see what changed. If performance stayed the same, it suggested those tools were unnecessary. But if performance dropped, it proved that the removed feature added real value. This let the team understand not just what worked, but why it worked—and for whom. For instance, they discovered that tools designed to help weaker models improve could sometimes hinder the performance of stronger models by slowing them down or introducing unnecessary complexity.

Another important piece of the evaluation was generalization. In business terms, this is the difference between an employee who can follow instructions for one task and one who can handle new, unfamiliar situations with sound judgment. Orak tested this by throwing models into scenarios they hadn’t seen before—new game levels, different characters, altered objectives. Models that could still perform well under those conditions were considered far more capable and versatile.

Yet for all its innovation, Orak isn’t without limitations. For one, it doesn’t yet simulate real-time constraints. In a game like Street Fighter or StarCraft, timing is everything—but Orak currently pauses the game while the LLM thinks. That’s helpful for fairness and analysis, but it leaves out a critical dimension: speed. In real-world applications—whether it’s a chatbot managing a refund or an AI helping route emergency responders—delays can break the experience. So future iterations of Orak will need to include latency-aware evaluations that judge both the quality and speed of decisions.

Another limitation is that Orak doesn’t yet support audio inputs. In many game and business settings, sound plays a huge role—alerts, dialogue cues, background noise. Training AI agents to handle audio alongside text and images is a frontier the Orak team is already eyeing. Similarly, Orak doesn’t yet incorporate reinforcement learning, a method where agents learn from trial and error, which could unlock more adaptive behaviors in future benchmarks.

Despite these constraints, the long-term impact of Orak is significant. For one, it gives companies and researchers a common ground to evaluate LLM agents. Whether you’re building a customer service assistant, a strategy planner, or an in-game companion, Orak offers a realistic test bed to measure how your agent stacks up—using scenarios that reflect real-world complexity, not just lab-controlled tasks.

Second, Orak helps lower the barrier to entry for innovation. The inclusion of a fine-tuning dataset—thousands of expert gameplay trajectories—means that even smaller teams or open-source developers can train competitive agents without needing to build a dataset from scratch. It creates a more level playing field, where performance is based on decision quality, not just compute budgets.

And perhaps most importantly, Orak pushes the conversation beyond narrow definitions of AI success. It encourages us to think about agents not as single-purpose tools, but as adaptive systems navigating dynamic environments. That’s true whether the environment is a battlefield, a call center, or a digital storefront. By setting a new bar for what it means to be “intelligent” in interactive, unpredictable spaces, Orak lays the groundwork for the next generation of LLM-powered agents—ones that can think, plan, and act with real-world stakes in mind.

Further Readings