We All Scream for Ice Cream
ICECREAM helps data teams move beyond isolated feature attribution by revealing robust, interpretable patterns.
There’s a paradox at the heart of modern machine learning. As our models get more accurate, their explanations often get more opaque. High-performing black boxes can predict with stunning precision. But when business leaders or domain experts ask why the model made a particular decision, the answers are vague at best, misleading at worst.
This problem isn’t just academic. In real-world, high-stakes industries (whether you’re dealing with patient diagnoses, fraud detection, or injury risk in athletes), decision-makers can’t afford to trust models they don’t understand. They don’t just want what the model thinks. They need to know why it thinks that.
Most traditional explanation tools (like SHAP values or feature importance scores) try to solve this. But they have a fundamental blind spot: they look at individual features in isolation. They assign credit or blame to each input one at a time—assuming each factor contributes independently to the outcome. That assumption might hold in textbook cases. But in the real world, outcomes often arise from interactions among multiple features, none of which would look significant on their own.
Let’s take an example from biomechanics. Suppose an athlete’s risk of injury spikes only when their shoulder rotation is slightly delayed and their stride length shortens and they’ve exceeded a certain workload. No single factor causes the problem. But together, they do. Traditional explanation methods would downplay or entirely miss this because no one variable is “important” by itself.
This is the blind spot the ICECREAM framework was built to fix.
From Single-Factor Explanations to Multi-Feature Coalitions
ICECREAM (short for Identifying Coalition-based Explanations for Common and Rare Events in Any Model) is a new framework that rethinks how we explain machine learning predictions. Instead of treating input features like independent actors in a play, ICECREAM looks for coalitions: sets of features that work together to influence a model’s output.
In other words, ICECREAM doesn’t just ask, “How much does feature X contribute to this prediction?” It asks, “What groups of features are working together to drive this outcome, and how robust is that explanation?”
At its core, the framework works by systematically identifying feature subsets (coalitions) whose combined absence or alteration meaningfully changes the model’s prediction. It builds on the idea of perturbation-based explanations (changing input values and observing the effect), but does so in a structured, stability-enhancing way. This allows ICECREAM to surface interactions that would otherwise be hidden, and to avoid spurious or overly fragile explanations that could be misleading.
ICECREAM also introduces a robustness layer. Not all coalitions are created equal. Some may only affect the model’s output under very specific, fragile conditions. ICECREAM filters these out by selecting only the coalitions whose explanatory power persists across multiple perturbations. That’s a big deal, especially when decisions are being made under pressure and need to hold up across edge cases.
Another key strength: ICECREAM explanations are instance-specific. That means instead of trying to describe global model behavior (which can be incredibly complex and full of exceptions), it focuses on explaining a single decision for a single case. That’s precisely the level of explanation most business users and operators need when they’re dealing with a flagged transaction, a rejected loan, or a high-injury-risk athlete.
The framework doesn’t require a special type of model; it’s model-agnostic and can be applied to any black-box predictor. That makes it immediately useful in environments where high-performance models are already in production and replacing them isn’t an option.
In essence, ICECREAM gives machine learning systems a better memory for how context matters. It stops asking, “Which factor matters most?” and starts asking, “What combination of factors explains this?” For any organization that relies on high-stakes, high-dimensional predictions, that shift is more than technical; it’s transformational.
Testing for More Than Just Accuracy
Once the ICECREAM framework was developed, its creators faced a crucial test: could this new explanation method not only reveal feature interactions that existing tools miss, but also do so in a way that people could trust and use?
To find out, they designed a series of experiments to evaluate ICECREAM across multiple axes … not just whether the explanations looked convincing on paper, but whether they actually made a difference in real-world-like settings. The goal wasn’t to chase another percentage point of predictive accuracy. The goal was to assess whether ICECREAM could surface the kinds of insights that stakeholders care about when making high-consequence decisions.
ICECREAM isn’t just about confirming what was already known; it’s also about providing a structured, repeatable way to expose deeper patterns—offering domain experts a way to validate their intuition and discover new interactions they hadn’t seen before.
But identifying patterns is one thing. Evaluating how reliable those patterns are is another.
Setting a Higher Bar for Explanation Quality
To measure ICECREAM’s success, the researchers didn’t just rely on whether its explanations sounded plausible. They set out to evaluate them against three criteria that matter most to real-world adoption:
- Stability: If an explanation changes dramatically when the input changes only slightly, it’s not very useful in practice. ICECREAM explanations were tested for robustness to small perturbations—ensuring that they remained consistent even when the input data shifted a bit. This property is critical in noisy domains like healthcare or biomechanics, where perfect measurements are rare.
- Actionability: The research team evaluated whether the coalitions identified by ICECREAM could be used to guide decisions in the domain. For example, in the sports injury use case, ICECREAM revealed actionable biomechanical signatures that could inform load management or technique adjustments. The evaluation asked not just “Does this make sense?” but “Can someone do something with this insight?”
- Alignment with expert intuition: Perhaps the most important test of all: do the ICECREAM explanations line up with what human experts already believe (and know) about the system being modeled? In evaluations involving domain specialists (like sports scientists and biomechanical analysts), ICECREAM consistently surfaced feature combinations that matched real-world understanding. And when it diverged, it offered a rationale that experts could follow, debate, and often validate through further investigation.
This holistic approach to evaluation marked a departure from the narrow benchmarks used in many explainability studies. ICECREAM wasn’t just built to produce prettier charts. It was built to earn trust … not through hand-waving or black-box magic, but by making complex predictions comprehensible and credible to the people who actually have to act on them.
In doing so, the research laid the groundwork for explanation methods that don’t just live in academic papers but thrive in operational workflows. Because ultimately, success for a system like ICECREAM isn’t about clever math; it’s about whether it helps someone make a smarter, safer, or faster decision when it counts.
Pushing Beyond Precision: What “Success” Really Looks Like
In most technical circles, it’s easy to confuse a successful model with a correct one. But in high-stakes decision-making (where outcomes affect real people, not just data points) accuracy is only part of the story. The ICECREAM framework was evaluated on a more demanding, human-centered criterion: can this explanation method help users make better decisions, faster, and with greater confidence?
To answer this, the research focused on measuring explanation utility, a more nuanced concept than correctness alone. Explanation utility isn’t just about whether a feature or coalition was “truly” important (a hard question when dealing with black-box models). It’s about whether the insights produced are stable, actionable, and understandable—and whether they actually lead to changes in behavior.
The researchers introduced a novel evaluation strategy called decision impact testing, where ICECREAM’s explanations were embedded into simulated decision workflows. In these experiments, human users were shown two explanations side-by-side (one from ICECREAM and another from a popular existing method) and asked to make or justify a decision based on each. The result? Users consistently rated ICECREAM’s coalitions as more trustworthy and useful. Just as important, the explanations influenced decisions in ways that were more aligned with domain-specific best practices.
That last point reveals something essential: the quality of an explanation isn’t just about the math behind it. It’s about how well it connects with how humans actually reason about problems. That’s why ICECREAM’s focus on coalitions (multi-feature patterns instead of isolated signals) felt more natural and intuitive to users across contexts.
And this emphasis on user trust turned out to be more than a nice-to-have. It was the core of how ICECREAM succeeded.
Where the Model Shines—And Where It Doesn’t (Yet)
Of course, no framework is flawless, and the researchers behind ICECREAM were careful not to overpromise. The method introduces computational overhead, since it must evaluate many feature combinations and test their robustness. In time-sensitive or resource-constrained environments, this could limit its use without further optimization or hardware acceleration.
There’s also the risk of combinatorial explosion: as the number of features in a model grows, the number of possible coalitions increases exponentially. ICECREAM includes strategies for managing this … like focusing on small, highly impactful coalitions—but scalability will remain an open challenge as use cases get more complex.
Another limitation lies in domain alignment. ICECREAM assumes that useful coalitions are discoverable in the data, but that’s not always true. In poorly labeled or sparsely sampled datasets, the coalitions it surfaces might still be misleading or spurious. The method is only as good as the patterns it can see.
Yet these are limitations of maturity, not direction. ICECREAM doesn’t fail by overstepping its claims; it succeeds by framing a better goal for explanation systems in the first place. It repositions interpretability from a mechanical add-on to a strategic imperative, especially in environments where the cost of misunderstanding a model could mean a missed diagnosis, a failed athlete recovery, or a multi-million-dollar decision made in error.
What This Means for Business, Operations, and Strategy
The broader impact of ICECREAM is philosophical as much as technical. It represents a shift away from the belief that explanations must be singular or simplistic. In reality, complex systems (be they athletes, financial markets, or supply chains) rarely break down for one reason alone. They tip, pivot, and collapse through sets of interdependent factors, often subtle and invisible until it’s too late.
ICECREAM gives decision-makers the tools to see those interdependencies in action. It transforms model transparency from a compliance box to check, into a source of competitive clarity. It equips practitioners not just to ask, “What is happening?” but “What combination of things is making this happen … and how can we intervene?”
That’s not just a better explanation. That’s a strategic advantage.
Further Readings
- Mallari, M. (2023, July 21). The sweet science of not getting hurt. AI-First Product Management by Michael Mallari. https://michaelmallari.bitbucket.io/case-study/the-sweet-science-of-not-getting-hurt/
- Oesterle, M., Blöbaum, P., Mastakouri, A. A., & Kirschbaum, E. (2023, July 19). Beyond Single-Feature Importance with ICECREAM. arXiv.org. https://arxiv.org/abs/2307.09779