A Break-Down of Research in Artificial Intelligence

Inter-Purchase Intervals: When More Context Misses the Point

Why inter-purchase interval prediction favors precision models over language models, and what “good enough” timing really means.

A recently published AI research from Walmart Global Tech tackles a deceptively simple but commercially critical question: when will a customer buy the same thing again? More precisely, it focuses on predicting the number of days until a user’s next purchase within a given product category (often called inter-purchase or repurchase interval prediction).

This problem sits at the heart of many modern digital businesses. Grocery platforms want to know when to remind you to buy milk again. Pet retailers want to ship food before you run out, but not so early that it piles up. Pharmacies want refill reminders to arrive at the right moment to support adherence without overwhelming patients. In all these cases, getting the timing wrong has real costs: lost revenue, customer churn, wasted inventory, or even health and safety risks.

Historically, companies have relied on statistical averages or machine learning (ML) models trained on structured behavioral data to solve this timing problem. More recently, however, there has been growing interest in whether large language models or LLMs (which appear to reason well over sequences, patterns, and context) could handle this task more flexibly. The intuition is appealing: if LLMs can “understand” user behavior in narrative form, perhaps they can infer when the next purchase should happen, even with limited or messy data.

The research paper puts that intuition to the test.

Rather than proposing a new algorithm, the researchers designed a controlled evaluation framework to measure how well LLMs can perform quantitative time-to-event prediction, and how that performance changes as more contextual information is added. The core task is framed simply: given a history of past purchase intervals for a user within a product category, the model must output a single number: the predicted number of days until the next purchase.

The key methodological idea is to vary how much context the model receives, while holding the prediction task constant. The researchers define three prompt regimes:

  1. In the zero-context setup, the model only sees past purchase intervals, forcing it to infer patterns purely from numbers.
  2. In the medium-context setup, the prompt adds lightweight, structured information such as product metadata or summary statistics (for example, averages or recent trends).
  3. In the high-context setup, the prompt includes richer behavioral descriptions, such as recency signals or user lifecycle information, mimicking the kind of detailed context businesses often believe will help LLM reasoning.

Importantly, all LLMs are evaluated in a zero-shot setting. There is no fine-tuning, no example-based prompting, and no task-specific training. This isolates what the models can do “out of the box,” which is often how they are first deployed in real systems.

To ground the results, the researchers compare LLM performance against traditional baselines: simple statistical heuristics (like using the historical median) and fully trained ML regressors that operate directly on structured features. This side-by-side design makes it possible to assess not just whether LLMs work, but when and why they fall short—or succeed—at predicting time itself.

To understand how well LLMs handle repurchase timing, the researchers ran a series of comparative experiments designed to mirror realistic commercial settings. Rather than testing on random or dummy data, they used two real-world e-commerce datasets: one proprietary grocery dataset and one publicly available grocery dataset. Both captured repeated purchases by the same users across product categories—making them well suited to studying short-term replenishment behavior.

The experimental setup was intentionally consistent across models. For each user–category pair, the system was given a sequence of past purchase intervals and asked to predict the next one. This prediction was repeated across thousands of cases—allowing the researchers to compare performance patterns rather than isolated anecdotes.

Crucially, the study evaluated three different model families side by side:

  1. Simple statistical baselines, such as using the historical average or median time between purchases: these reflect the kinds of heuristics many businesses still rely on.
  2. Traditional ML regressors trained on structured features derived from purchase history: these models represent the current “best practice” for numeric prediction tasks in production systems.
  3. Multiple frontier LLMs: each tested under the same prompt designs but without any fine-tuning or examples.

The most striking result was not that one model won, but how performance shifted depending on the level of context. Compared to statistical heuristics, LLMs consistently performed better—indicating that they can extract meaningful temporal patterns from historical data. However, when compared to trained ML models, LLMs fell short (sometimes by a wide margin) on precise numeric accuracy.

Context played a nuanced role. Adding a modest amount of structured context generally helped LLMs improve their predictions. But when the prompts became rich with behavioral detail and narrative signals, performance often degraded. In other words, more information did not reliably lead to better predictions. In several cases, the models appeared to lose focus on the core temporal signal–producing noisier or less stable outputs. This directly challenged the common assumption that LLMs benefit monotonically from additional context.

The evaluation framework reflects how businesses actually judge success. Standard regression metrics such as error magnitude were used to assess raw accuracy, but the researchers also introduced tolerance-based measures that ask a more practical question: Was the prediction close enough to be useful? For many operational decisions (such as sending a reminder, triggering a reorder, bundling shipments), being within a day or two of the correct timing is often sufficient.

In addition to accuracy, the researchers evaluated latency and cost considerations, acknowledging that real-world deployment requires predictions to be fast and affordable at scale. Even when two models perform similarly on accuracy, differences in response time or inference cost can determine whether a solution is viable in production.

Taken together, the experiments show that LLMs occupy an interesting middle ground. They outperform naive heuristics and show promise on approximate timing tasks, but they struggle to match the reliability of purpose-built ML models when precision matters. The way success is measured (exact correctness versus “good enough” timing) ultimately shapes how attractive these models appear for real business use.

A key contribution of the research lies in how success and failure are defined. Rather than treating repurchase prediction as a purely academic forecasting problem, the researchers align evaluation with operational decision-making. Success is not just about minimizing numerical error; it is about whether a prediction supports the right action at the right time. This distinction matters because many business interventions (such as reminder notifications, replenishment prompts, or shipment scheduling) operate within flexible windows rather than on exact dates.

To reflect this reality, the evaluation framework emphasizes tolerance-based outcomes alongside traditional error measures. Predictions are judged on whether they fall within an acceptable range around the true repurchase date–mirroring how businesses think in terms of “too early,” “too late,” or “close enough.” This framing reveals an important insight: some models that look weak under strict numeric metrics can still be operationally useful, while others that optimize for precision may deliver limited marginal value once a practical threshold is met.

The study also treats cost and responsiveness as first-class evaluation dimensions. A model that produces marginally better predictions but is slow or expensive to run may fail in real deployment scenarios. By explicitly considering latency and inference cost, the research highlights a constraint often overlooked in academic benchmarks but central to production systems that must generate millions of predictions per day.

Despite these strengths, the paper is careful about its limitations. Most notably, the experiments are conducted in a zero-shot setting. This choice provides a clean comparison of out-of-the-box capabilities but likely understates what LLMs could achieve with task-specific calibration, example-based prompting, or lightweight fine-tuning. The scope of the data also matters: the focus is on short repurchase cycles with relatively dense purchase histories. Scenarios involving long gaps, sparse signals, or entirely new products may behave differently.

Another limitation lies in how context is represented. The study shows that excessive context can harm performance, but this does not mean all rich information is inherently harmful. It suggests that how context is structured and prioritized may matter more than how much is provided. The prompt designs tested represent only a subset of possible ways to encode behavioral information.

These constraints point directly to future directions. One promising path is hybrid systems that separate concerns: using traditional models to generate reliable numeric estimates while leveraging LLMs for tasks they excel at, such as interpreting unstructured data, handling edge cases, or explaining recommendations to users. Another direction involves learning how to compress context into decision-relevant signals rather than verbose narratives.

The broader impact of the work is conceptual. It challenges the assumption that general-purpose language models are a drop-in replacement for specialized predictive systems, especially when the task involves precise quantitative timing. At the same time, it reframes LLMs as potentially valuable components within a larger decision pipeline. By clarifying where these models succeed, where they fail, and how success should be measured, the research provides a more grounded foundation for deploying AI in time-sensitive, customer-facing applications.


Further Reading

Free Case Studies