Oops, All Procedures!

Tuesday, June 10, 2025

Imagine you’re running a company that relies on teams—human or digital—to follow detailed step-by-step procedures. Whether it’s assembling a car, processing a loan application, or inspecting a jet engine, there’s a defined way things must be done. In most industries, these instructions are captured in what are known as Standard Operating Procedures, or SOPs. They ensure consistency, safety, compliance, and quality across complex workflows. SOPs are everywhere, from manufacturing floors to hospitals to customer service centers. Now imagine you’re trying to automate parts of these workflows using artificial intelligence—specifically, large language models (LLMs) like GPT-4 or Claude. You’d expect these advanced systems to follow SOPs just like a well-trained employee might.

But here’s the problem: they can’t. Not reliably, at least.

While LLMs have made enormous strides in generating text, writing code, summarizing documents, and even making plans, they tend to falter when asked to execute multi-step processes that demand strict adherence to detailed rules, conditional logic (“if this, then that”), and tool use. Most AI benchmarks today test simple tasks like asking an LLM to call a calculator or retrieve a fact. But SOPs in the real world are longer, more complex, and often filled with ambiguity. They require an agent to read instructions, decide which tool to use, follow a particular sequence of actions, handle exceptions, and know when it’s completed the task successfully. That’s a much taller order—and it’s where today’s AI agents stumble.

To address this gap, researchers from the Allen Institute for AI, Stanford, and other institutions built something called SOP-Bench. It’s not a product or an AI model. It’s a benchmark, which you can think of as a standardized test for evaluating whether AI systems can follow realistic procedures. But this isn’t your typical multiple-choice test. SOP-Bench simulates what it’s like to be dropped into a real company, given a dense operating manual, and asked to use the company’s internal software tools to complete a job. The tasks span industries—from aircraft maintenance to email triage—and mimic real SOPs in their length, conditional complexity, and dependency on external tools.

Creating this benchmark wasn’t as simple as copying real documents. Industrial SOPs are often proprietary and confidential. So the research team built a synthetic data generation pipeline using LLMs themselves. In short, they used powerful AI models to help write realistic SOPs modeled after the structure and ambiguity of real-world examples. Each synthetic SOP includes not just instructions but a list of “tools” (mock APIs), and a way to check whether an AI agent completed the task correctly. This human-in-the-loop system helped ensure the data was both realistic and high-quality.

To test the benchmark, the researchers evaluated two common AI agent architectures: one based on function-calling (where the AI outputs structured commands for tools), and another called ReAct, which alternates between reasoning and acting step-by-step. Both agents were tasked with executing the SOPs using only the tools provided. The idea was to test not just if they could understand the instructions, but whether they could follow through accurately, use the right tools, and handle multi-step logic without going off track.

In short, SOP-Bench puts today’s most powerful AI agents in the role of an enterprise intern with access to SOP manuals and digital tools—and then asks whether they can do the job. The early results offer a sobering look at where the limits still are.

To understand whether today’s AI agents are truly ready for real-world task automation, the research team didn’t stop at building a benchmark—they put the agents to work. The experiments were structured to mirror what would happen if a digital worker were handed an SOP and a list of software tools, then asked to complete a job end-to-end. The goal wasn’t just to see whether the agents could understand instructions, but whether they could navigate the kinds of nuanced decisions and tool interactions that human workers make every day without thinking twice.

Each test scenario came with three components: a detailed SOP written in natural language, a set of available tools that the agent could use to complete the task, and a test case to validate the final output. These tools weren’t just buttons to press—they represented real API-like functions with inputs and outputs, mimicking the digital systems companies already rely on. For example, if the SOP instructed the agent to “validate the part number in the inventory system,” it had to call a tool that simulated an actual inventory check, not just return a canned answer.

The researchers tested two AI agent architectures across hundreds of these SOP tasks, each varying in complexity, domain, and workflow logic. One architecture used a structure where the AI would decide on a tool to use and format its request in a structured function-call. The other used a more flexible, human-like approach where the agent reasoned through each step, reflected on outcomes, and adjusted its actions as needed. These two strategies were chosen because they represent leading paradigms in how AI agents are currently being designed for enterprise use.

The agents didn’t just get a single try at each task. Much like a human worker, they were allowed to iterate: making decisions, calling tools, checking results, and continuing until they either completed the job or got stuck. This gave the researchers a window into how well the agents could reason through branching logic, recover from missteps, and maintain context across multiple steps—a crucial part of following any SOP correctly.

Evaluating success wasn’t a matter of gut instinct or subjective judgment. Each SOP task had one or more human-validated test cases—clear, predefined conditions that had to be met for the agent’s output to be considered correct. This could involve comparing the final output to an expected result, verifying that the correct tool was used at the right time, or checking whether a specific conditional path was followed as dictated by the SOP. The idea was to bring the same kind of rigor to evaluating AI performance that a company might use to audit human compliance with standard procedures.

Beyond just checking right or wrong answers, the researchers also examined how the agents approached the tasks. Did they understand the underlying logic of the SOP? Were they using tools unnecessarily or skipping steps? Did they make errors in sequencing? These deeper behavioral insights helped reveal the limitations not just in understanding, but in execution—exactly the kind of breakdowns that would matter in real business settings.

What became clear from the experiments was that following a workflow on paper and following it in practice are two very different challenges for today’s AI systems. It’s not enough for an agent to summarize an SOP accurately or answer quiz-style questions about it. The real test is whether it can apply those instructions dynamically and reliably within a tool-rich environment—and that’s where cracks in current AI designs start to show.

The research team’s evaluation framework was designed to be more than just a pass/fail checklist. Yes, they used clear test cases to determine whether each AI agent completed a task correctly. But success was also measured through a broader lens: Did the agent use the right tools in the right sequence? Did it follow the SOP faithfully, including handling branches, conditions, and exceptions? And just as important—when it failed, how did it fail?

In practice, this meant examining not only the final result, but the chain of decisions that led there. Each step the agent took—each tool it called, each reasoning step it expressed—was logged and reviewed. This approach allowed the researchers to see if the agent simply misunderstood the SOP, if it chose an incorrect tool when several options were available, or if it stopped too early without completing all the required steps. These insights helped identify where in the workflow things broke down, which in turn pointed to the limits of the current agent architectures.

For example, when given too many tool choices, agents frequently chose the wrong one—even when the correct option was explicitly available. In other cases, agents skipped key conditional steps or got caught in loops, re-checking the same information without progressing. These weren’t trivial mistakes—they mirrored the kinds of execution errors that, in the real world, can lead to safety failures, customer issues, or regulatory breaches.

Even with the robust structure of SOP-Bench, the researchers are clear-eyed about its current limitations. The entire benchmark is built on synthetic SOPs and simulated tools. That was a deliberate choice—real-world SOPs are often proprietary, and live tool environments are risky to test at scale. But it also means the benchmark may not capture the full messiness of enterprise environments: incomplete documentation, outdated systems, flaky APIs, and human fallbacks that occur outside of official procedures.

Another limitation lies in scope. SOP-Bench covers a broad set of industries, but it’s still just a slice of the complex universe of workflows that businesses rely on every day. It also evaluates just two agent designs. As new architectures and memory systems evolve, they’ll need to be tested within this or similar frameworks to understand whether they truly improve performance in these demanding tasks.

That said, the creation of SOP-Bench is a turning point in how we think about deploying AI in real workflows. Rather than focusing on short-form tasks or isolated tool calls, it raises the bar for what it means for an AI agent to be truly useful in operational settings. It’s one thing for a language model to draft an email or summarize a report. It’s quite another to rely on it to execute an end-to-end process that affects customers, compliance, or safety.

In terms of impact, this benchmark lays the groundwork for more reliable, testable automation. Enterprises exploring LLM-based agents now have a clearer framework for asking: Can this system really handle the workflows we care about? And more importantly—where does it fall short?

Looking forward, the authors invite the broader AI and enterprise communities to expand SOP-Bench by contributing their own domains, tools, and SOP scenarios. The hope is that with broader participation, this work will help steer the next generation of AI systems toward something more than just general-purpose intelligence: something closer to operational reliability. Until then, SOP-Bench serves as both a reality check and a roadmap for what AI agents still need to learn to become truly useful in the world’s most demanding environments.

Further Readings