Taming the Wolves Inside the Machine
Understanding how AI language models weigh competing values like truth, politeness, and clarity (and why that matters for trustworthy deployment).
If you’ve worked with large language models (LLMs)—like those powering today’s AI chatbots, customer service tools, or document automation—you’ve likely run into a curious problem. These systems seem smart, even polite. But ask them to walk the line between being honest and being diplomatic, or between offering concise help and avoiding offense, and something breaks. They might offer factually accurate advice that’s tone-deaf, or say something charming but vague. Worse still, it’s often hard to tell why they made those choices. That’s the heart of the problem tackled in a recent research paper from Stanford and Google DeepMind: how do we understand the hidden trade-offs language models are making when they respond to us?
Modern LLMs are trained to respond to human prompts in helpful, truthful, and harmless ways. But these goals can conflict. Consider a scenario: your friend bakes you a cake, and it’s objectively awful. Do you tell the truth and risk hurting their feelings, or soften the blow and risk being dishonest? Humans intuitively navigate such trade-offs, balancing honesty against kindness or clarity against brevity. But when LLMs do this, we have no clear lens to see how they’re weighing competing values.
This creates two major challenges. First, businesses can’t debug or audit model behavior in high-stakes contexts like healthcare, finance, or HR—where misjudging the balance between, say, legal precision and empathy can have real-world consequences. Second, AI development teams struggle to shape these trade-offs during training or fine-tuning, since they don’t have a clear map of the terrain. Even if a model seems polite, we don’t know whether that’s due to its underlying value structure or a surface-level trick.
That’s where this new research comes in.
The researchers turn to an unlikely source for answers: cognitive science. Specifically, they use a framework called the Rational Speech Acts (RSA) model, originally designed to explain how humans communicate with intent and nuance. RSA is grounded in a simple but powerful idea: when we speak, we’re not just transmitting facts—we’re choosing what to say based on our goals and our beliefs about how the listener will interpret us.
RSA models this as a kind of decision-making process. A speaker (human or model) weighs different “utilities”—like being informative, being socially considerate, or protecting their image. These are represented mathematically by parameters that indicate how much weight the speaker gives to each goal. By observing what a speaker says across a range of situations, you can work backward to infer these hidden priorities.
The authors apply this exact technique to LLMs. By putting models through carefully designed politeness scenarios (such as giving feedback or answering awkward questions), they observe the model’s response patterns and fit them into the RSA framework. This lets them reverse-engineer how much the model seems to care about truthfulness versus social harmony, or other value trade-offs.
In short, this approach opens the black box: it lets us interpret model behavior through the lens of human-like decision-making. And that’s a breakthrough. Instead of guessing why a model was polite or blunt, we now have a principled way to diagnose and quantify its internal value structure.
This isn’t just interpretability for the sake of understanding. It’s a practical step toward accountability and control—especially in domains where these subtle value judgments make all the difference.
To test their approach, the researchers didn’t just pick a single model or scenario—they built an entire evaluation suite across both commercial and open-source language models. Their goal wasn’t to benchmark how “good” or “bad” each model was in the abstract. Instead, they wanted to uncover how different models, under different development conditions, prioritize competing goals like truthfulness, kindness, and self-image in real-time decision-making.
They began by adapting scenarios from social psychology—specifically, experiments that study how people choose words in socially sensitive situations. For example, imagine you’re giving feedback to a colleague, or declining a request, or delivering bad news. Each situation presents a classic value trade-off. The model is expected to choose among several possible responses, each leaning in different directions: some are blunt and factually direct, others soften the message with social tact, while a few might even hedge or avoid the issue.
These scenarios were not arbitrary—they were carefully crafted to trigger conflicts between competing communication goals. By running various models through dozens of such situations, the researchers collected rich behavioral data on how each model responds under pressure.
Then came the clever part. They used the Rational Speech Acts (RSA) framework to “fit” each model’s behavior, extracting hidden values that represent how much weight the model seems to place on being informative versus being socially considerate. Think of it like psychological profiling—but for language models. The models weren’t just evaluated on what they said, but why they said it, based on the inferred utilities guiding their choices.
What they found was revealing. First, reasoning-intensive models (those tuned for deeper problem-solving or more thoughtful answers) tended to favor informativeness over social sensitivity. In plain terms, these models were more likely to be blunt—even when it might hurt someone’s feelings—because they over-weighted factual accuracy. That’s not necessarily a flaw, but it’s a critical trait to understand if you’re deploying these models in contexts like customer service, education, or therapy.
Second, the researchers observed that many of the changes in a model’s value profile occurred early in the alignment process—during the first stages of reinforcement fine-tuning. This matters because most organizations assume that alignment is an ongoing process with gradual refinements. Instead, the study suggests that initial tuning steps lock in many of the core trade-offs, making it harder to course-correct later.
Third, they explored how different development choices—such as the base model used, the feedback data it was aligned on, or the fine-tuning method (like DPO vs. PPO)—influenced these trade-offs. Interestingly, they found that the initial choice of pretraining and base model architecture had a much larger effect on value preferences than the later alignment steps did. That’s a strategic insight for teams designing new models: upstream decisions may matter far more than tweaks made downstream.
To evaluate whether their method was working, the researchers didn’t just check if the RSA model could generate plausible predictions. They measured how well the inferred values matched human intuition—essentially asking: do these utility weights make sense given what we know about human communication? A strong fit meant the model’s behavior was explainable in human terms. A poor fit suggested either that the model was behaving unpredictably or that the framework itself missed something.
In this way, the paper doesn’t merely suggest a new analysis tool. It offers a diagnostic lens for tracking and adjusting how models evolve across training, and a principled way to ask whether they’re learning the right trade-offs for the task at hand.
One of the most compelling aspects of this research lies in how success is defined. Unlike traditional AI evaluations that rely on accuracy scores or human ratings of “helpfulness,” this study measures success through explainability—specifically, how well the Rational Speech Acts (RSA) model fits the observed behavior of a language model.
Imagine a model chooses a polite but vague response in a social scenario. The researchers ask: Given the options it had, does this choice align with a consistent internal value structure? The better the RSA model can explain the model’s choices across many such scenarios, the more confidence we have that the model is behaving in a goal-directed, interpretable way. In other words, success isn’t about perfection; it’s about coherence—are the model’s behaviors traceable to a stable set of underlying priorities?
This is a critical shift in how we think about AI behavior. In high-stakes applications—whether in law, healthcare, or government services—it’s not enough for a model to seem polite, accurate, or compliant. We need to understand why it chose one behavior over another, and whether that decision reflects intentional design or unpredictable drift. That’s where this interpretability framework becomes powerful: it turns hidden trade-offs into quantifiable, auditable patterns.
But the research isn’t without limitations. The politeness scenarios used are necessarily simplified—controlled vignettes where the range of responses is predefined. This makes it easier to measure trade-offs, but it doesn’t capture the messiness of real-world language, where people blend goals across sentences, shift tones mid-paragraph, or make trade-offs implicitly. In this sense, the RSA framework is a proof-of-concept: a microscope that works best on narrow behavioral slices, not yet on the open-ended complexity of natural conversation.
Another challenge is dimensionality. Human communication involves more than just informativeness and social consideration. In real life, people also care about things like credibility, self-protection, likability, and context sensitivity—many of which are hard to reduce to neat mathematical utilities. As a result, the current RSA model captures only a subset of the values at play in model decisions. It provides insight, but not a full picture.
The researchers acknowledge these limits and suggest next steps. Future work could extend the RSA framework to cover more realistic language tasks, such as multi-turn dialogue, persuasion, or negotiation. They also propose using LLMs themselves to help “translate” free-form behavior into structured utilities—essentially asking models to introspect on their own value trade-offs. While still speculative, this could eventually lead to self-explaining AI systems that justify their outputs not just with content, but with rationale: “I said this because I weighted honesty more heavily than tact in this context.”
So what’s the broader impact? This research helps shift the conversation in AI development from outcome to intention. Rather than treating models as mysterious tools whose behavior we learn to tolerate, it pushes us toward value-aware design: models we can train, tune, and trust because we understand how they think.
For business leaders, this matters. As language models are increasingly embedded in operations—from frontline chatbots to strategic decision support—the ability to audit their internal trade-offs becomes more than a research curiosity. It becomes a business imperative. If you can’t see how a model is balancing values, you can’t control outcomes—and that’s a risk no serious enterprise can afford.
This paper doesn’t solve alignment. But it offers a toolkit—a way to reveal, measure, and eventually guide the values that models bring to the table. And that, for AI teams and decision-makers alike, is a step toward better alignment between machine goals and human intent.
Further Readings
- Mallari, M. (2025, June 27). Benefit of the boubt: teaching AI to speak like it means it. AI-First Product Management by Michael Mallari. https://michaelmallari.bitbucket.io/case-study/benefit-of-the-boubt-teaching-ai-to-speak-like-it-means-it/
- Murthy, S. K., Zhao, R., Hu, J., Kakade, S., Wulfmeier, M., Qian, P., & Ullman, T. (2025, June 25). Inside you are many wolves: using cognitive models to interpret value trade-offs in LLMs. arXiv.org. https://arxiv.org/abs/2506.20666