From Black Box to Profit Box

Saturday, August 23, 2025

AI has entered a new phase. We’re no longer dealing with single “smart” models answering isolated questions. Instead, businesses are deploying agentic AI, applications that string together multiple models, tools, and data pipelines to tackle more complex workflows like video analysis, software code review, or multi-step customer support. These systems promise transformational efficiency. But there’s a major catch: the way they’re served today is clunky, rigid, and incredibly expensive.

At present, most organizations hard-wire the logic of their agentic applications. Developers specify which models to call, how to connect them, and where to run them. Orchestration frameworks help stitch these steps together, but from the cloud provider’s point of view, the whole system is just a sealed black box. That means the infrastructure layer (the GPUs, CPUs, and energy budget powering these workflows) can’t see what’s going on inside.

The result? Overprovisioned clusters—ballooning GPU bills, and wasted energy. Latency and accuracy targets (commonly referred to as service-level objectives or SLOs) are routinely missed because no one is managing trade-offs across the entire workflow. Each team optimizes its own piece in isolation, while the business bears the cost of inefficiency.

Put in consulting language: firms are trying to run a supply chain where every supplier picks its own production methods without coordination, and the head office gets stuck with excess inventory, missed shipments, and rising logistics costs.

Enter Murakkab, the framework proposed in a recently published AI research (from Microsoft and MIT). Its central idea is straightforward yet powerful: separate the “what” from the “how.”

The “what”: Developers declare the logical workflow, essentially a blueprint of tasks and dependencies (e.g., extract video frames, transcribe audio, generate answers).

The “how”: Instead of baking in fixed model or hardware choices, Murakkab chooses these dynamically. It decides which models to invoke, how to parallelize them, and which hardware (A100s, H100s, CPUs) to run them on.

This is akin to how modern manufacturing distinguishes between a bill of materials and a factory floor plan. Product managers design the blueprint; plant managers decide which machines and shifts fulfill that blueprint at the lowest cost while meeting quality and delivery promises.

To make this work, Murakkab applies three interlocking mechanisms:

Profiles: Workflow profiles capture how changes in model size, token length, or execution order affect end-to-end quality, latency, and resource consumption. Model profiles record how each model performs on different hardware, including metrics like time-to-first-token, throughput, GPU/CPU utilization, cost, and energy use. Think of these as performance scorecards that make trade-offs visible to the optimizer.
Optimization engine: A mixed-integer linear program (MILP) crunches these profiles to pick the optimal combination of models, parallelism, and hardware. The objective is simple: satisfy the SLO at the lowest possible cost or energy footprint. Strategically, this is like portfolio optimization—balancing risk, return, and constraints across multiple assets. Here, the “assets” are models and hardware, and the “returns” are accuracy and latency.
Adaptive runtime: Murakkab doesn’t stop at a one-time plan. It periodically re-optimizes—adjusting allocations as demand spikes, workloads shift, or GPU inventory changes. It can also multiplex compatible workflows on the same hardware to squeeze out further efficiencies. This is the operational layer, the equivalent of a factory running continuous improvement cycles to keep production aligned with fluctuating demand and resource availability.

Having defined the problem and outlined the framework, the researchers put Murakkab through its paces. They didn’t test it on toy examples or carefully curated benchmarks. Instead, they chose three real-world classes of agentic applications that reflect the diversity of demands companies are already grappling with: video question answering, software code generation, and math problem solving. Each of these workloads poses different pressures on an AI serving system: highly varied input sizes, multi-step dependencies, and fluctuating compute intensity.

The selection of these workloads was intentional. Video Q&A represents the growing category of multimodal use cases, where text, audio, and visuals must be stitched together seamlessly. Code generation reflects the mission-critical nature of enterprise productivity and the demand for accuracy over speed. Math Q&A highlights structured problem-solving chains that must preserve logical consistency. Taken together, these applications mirror the complex, heterogeneous mix of tasks that large organizations are beginning to push through AI agents.

Rather than running one-off tests, the researchers grounded their experiments in production-scale traces. In other words, they modeled workloads as they would appear in the real world—bursty, unpredictable, and dependent on a variety of models and steps. Murakkab was then compared against baseline approaches where workflows were served in a fixed, pre-determined manner.

The comparison was structured around three variants:

A static baseline—mimicking today’s industry-standard approach of hard-coded orchestration.
An optimized mode, where Murakkab tailored execution for each workflow in isolation.
An optimized plus multiplexing mode, where Murakkab layered in the ability to share hardware across multiple workflows when possible.

This setup allowed the team to see not only whether Murakkab improved performance, but also whether added sophistication (like multiplexing) yielded incremental gains.

The findings showed a consistent pattern. Murakkab was able to deliver the same or better levels of accuracy and responsiveness while dramatically lowering the underlying resource consumption. Importantly, this was achieved without sacrificing the user-facing promises… the quality of answers or the speed of response.

The optimized mode already demonstrated meaningful savings by selecting smarter model and hardware configurations. The multiplexing mode extended these gains further—underscoring the value of treating the portfolio of workflows as a whole rather than as isolated silos.

The key insight here is not the specific numbers, but the principle: when workflows are decoupled from their execution environment and guided by an optimizer, the system can find efficiencies that would be invisible to a developer wiring things by hand.

To ensure rigor, the researchers didn’t just focus on cost savings. They established clear success metrics aligned to the priorities of real-world operators:

SLOs were central. These were framed either around accuracy tiers (e.g., best, good, fair) or latency thresholds (e.g., response within a certain percentile). Murakkab’s first duty was always to meet these targets.
Efficiency metrics were then layered on top—covering GPU utilization, energy consumption, and total infrastructure cost. These were the measures of operational excellence once the SLOs were satisfied.
Adaptivity was also evaluated. The system needed to show it could handle shifts in workload demand or resource availability by re-optimizing periodically. Experiments included scenarios where the mix of hardware or the intensity of incoming requests changed—testing Murakkab’s ability to adapt without letting service levels slip.

This evaluation framework mirrors how an executive team would assess any new operating model: does it meet customer-facing promises, does it lower unit costs, and does it hold up under volatility? By proving itself against all three, Murakkab demonstrated not just a clever technical trick, but a viable foundation for real-world deployment.

When evaluating a new system like Murakkab, success isn’t about abstract benchmarks; it’s about whether the framework can consistently deliver on the outcomes that matter most to both end users and operators. The research team approached this with a pragmatic lens—asking the same questions an executive team might: Are promises to the customer being met? Are the economics improving? And is the solution resilient when the environment shifts?

The first yardstick was adherence to SLOs. These objectives weren’t fuzzy aspirations but concrete guarantees, such as hitting a target response time or maintaining a defined tier of answer quality. Murakkab was judged on its ability to consistently deliver these outcomes, no matter what adjustments it made behind the scenes.

The second measure was operational efficiency. Here the focus shifted to the inputs: how much compute, energy, and cost were required to maintain those SLOs. Efficiency gains are meaningful only if they come without degrading the customer experience. By embedding efficiency as a secondary (but non-negotiable) criterion, the evaluation mirrored how real enterprises weigh performance versus cost.

The third dimension was adaptivity under change. Business environments rarely stay stable: demand spikes, hardware availability fluctuates, and workloads evolve. Murakkab was tested for its ability to re-optimize in the face of such churn—rebalancing workloads and reallocating hardware while keeping service promises intact. This aspect of evaluation positioned Murakkab not just as a point solution, but as a system capable of operating under continuous uncertainty.

Even with these strengths, the research did not shy away from limitations. The first is profile freshness. Murakkab relies on performance profiles of models and workflows to make optimization decisions. If those profiles drift out of date (as models update, hardware drivers change, or workloads shift), the optimizer may make suboptimal choices. Keeping these profiles accurate becomes an ongoing operational responsibility.

Another limitation is scope of testing. The experiments demonstrated impact on three distinct workflows, but the universe of agentic applications is far larger. While the underlying principles are broadly applicable, extending Murakkab’s coverage requires additional profiling and integration work for each new workflow type.

Finally, the data foundation itself poses limits. The experiments were run against production-scale traces adapted from one cloud provider. While realistic, they may not capture the full variability of other industries, geographies, or customer mixes. Broader validation will be necessary to prove Murakkab’s universality.

Looking forward, the research points to several future directions. Scaling across larger and more heterogeneous clusters will be critical as organizations mix and match different accelerator types. Developing specialized optimizers for emerging agent types will extend applicability beyond the initial set of workloads. And deeper cross-workflow optimization—treating the portfolio of agents not as independent silos but as an interdependent system—offers even greater potential for efficiency.

Stepping back, the overall impact of this approach is clear. Murakkab reframes the way organizations should think about serving agentic AI. Instead of treating each workflow as a one-off technical project, it invites leaders to manage their AI estate as a portfolio of services, with explicit quality and latency tiers, optimized at the portfolio level.

For strategy, this creates flexibility (decisions about which models or hardware to use become dynamic levers rather than fixed commitments). For tactics, it delivers savings—dramatically reducing the unit cost of serving AI without hurting customer-facing outcomes. And for operations, it provides resilience, a system that continuously adapts to volatility in demand and supply.

In short, Murakkab is not just a technical framework; it is a blueprint for running agentic AI as a disciplined, optimizable business function. That shift could mark the difference between AI pilots that remain costly experiments and AI platforms that deliver sustained, enterprise-wide value.

Further Reading