Prompt and Circumstance
How constitutional classifiers help AI systems prevent jailbreaks, reduce risk, and align with organizational values at scale.
When Ava joined PhishNetics (a fictional AI-powered cybersecurity startup known for its cutting-edge threat detection platform), she didn’t expect the biggest breach risk would come from their own product. As fictional director of threat intelligence, she was the strategic force behind ThreatWhisper, an AI assistant designed to help clients interpret cybersecurity threats in real time. Using large language models (LLMs), the assistant could summarize malware behavior, explain logs, and guide incident response workflows faster than any junior analyst.
For enterprise clients, ThreatWhisper was a game-changer. It reduced ticket handling times, empowered non-technical teams, and even helped smaller security operations centers (SOCs) scale. But under that surface success, a troubling pattern had started to emerge—one that wasn’t reflected in ticket resolution metrics or Net Promoter Scores.
An internal report flagged something odd: a user had asked ThreatWhisper how to reverse engineer an encrypted file. The model complied. The prompt was subtle, disguised in vague wording and curiosity. But the AI responded with guidance that could have easily been used to circumvent basic security controls. Technically, the response wasn’t incorrect, but it was precisely the kind of answer that no responsible cybersecurity vendor should be allowing.
Ava knew this wasn’t just a glitch. It was a red flag that suggested deeper, systemic vulnerability.
Sophistication Meets Exploitation
In cybersecurity, the threat landscape changes by the hour. PhishNetics had trained its model with layers of traditional filters: blacklists, regex rules, keyword monitors. But the nature of jailbreaks had evolved. Instead of brute force prompts, attackers now used social-style phrasing, embedded hypotheticals, or roleplay scenarios to bypass restrictions. Worse, these strategies weren’t confined to hackers in dark basements. Increasingly, users with only moderate technical knowledge were discovering and sharing prompt exploits that could be applied broadly, so-called “universal jailbreaks.”
For Ava, this meant two things: first, the old moderation stack wasn’t sufficient. And second, these jailbreaks weren’t edge cases; they were scalable vulnerabilities that could be chained into real-world compromises.
At the same time, external pressure was mounting. One of PhishNetics’ largest clients in the financial sector had flagged their AI usage during an audit and demanded clear documentation on guardrails. Their security team had caught wind of jailbreak incidents in similar tools and wanted guarantees that ThreatWhisper wouldn’t assist in, say, helping someone obfuscate command-line payloads or disable endpoint protections.
Complicating things further, the company was preparing for a new round of investment, and the leadership team had added compliance certifications as part of its due diligence checklist. Ava’s internal memo had reached the CEO’s desk. Suddenly, the question wasn’t just whether the product could be exploited. It was whether the company could still be trusted if it failed to take immediate action.
When Trust Breaks, Growth Halts
The fallout from failing to fix the issue would be more than technical. It would be reputational. If news leaked that a cybersecurity vendor’s AI assistant could be tricked into helping adversaries (even unintentionally), it would set off a wave of client churn. Large enterprises wouldn’t tolerate ambiguity in tools designed to protect their environments. And smaller clients, often more risk-averse, would look to competitors that appeared to take safety more seriously.
The product roadmap would also grind to a halt. New features planned for release would be deprioritized in favor of patching and damage control. Sales conversations would shift from strategic value to defensive explanations. Investor conversations would turn skeptical, focused on governance and risk mitigation rather than expansion and innovation.
Even internally, cracks would deepen. Analysts who once relied on ThreatWhisper to streamline their work would begin second-guessing its outputs. Engineers would burn out responding to hotfix requests. And Ava (tasked with protecting both the product and its customers) would lose the credibility she had built across teams and clients.
The danger wasn’t just what the model could say in the wrong hands. It was what that risk symbolized: a system that was too helpful, not helpful enough. A business that enabled safety, while simultaneously enabling risk. And a leadership team that needed to decide (fast) whether it would take the harder path toward lasting trust or the easier one toward short-term containment.
Turning Principles Into Guardrails
Ava understood that this wasn’t a matter of tweaking filters or retraining the LLM with a few bad examples. The issue was architectural. The model itself wasn’t malicious; it was simply trained to be helpful. But helpfulness, without clear values and consistent boundaries, becomes a liability. What Ava and her team needed was a way to inject judgment into the system, something that could differentiate between an employee asking for a malware signature breakdown and one trying to quietly work around endpoint protection.
That’s when Ava proposed something unconventional for a fast-scaling cybersecurity company: a principles-first safety layer inspired by recent research into constitutional classifiers. Instead of relying on static rules, she pitched the idea of wrapping ThreatWhisper in a dynamic layer of classifiers trained not just on what’s bad, but on what the company believes is acceptable.
This wasn’t just about filtering keywords; it was about teaching the system how to make judgment calls in real time. And unlike traditional safety layers bolted onto the backend, this new approach would be foundational: a constitutional framework that the AI would consult (implicitly and explicitly) before and after every interaction.
The leadership team was intrigued. The constitutional classifier concept offered a blend of structure and flexibility, ideal for an environment where policy, regulation, and adversary tactics were in constant flux. The proposal also came with a strategic upside: the solution could be updated in days, not quarters, without waiting on full model retraining cycles.
Building the Constitution That Works for Cybersecurity
The first step was philosophical, not technical. Ava’s team, in partnership with compliance and customer success leads, drafted a “ThreatWhisper Constitution”, a collection of core safety principles that reflected the product’s purpose and promise to users. Among them were guidelines like: The assistant should not help users bypass cybersecurity controls, and the assistant should never offer assistance in executing unauthorized code.
But instead of hardcoding these principles into the model, the team used them to generate synthetic examples of good and bad responses. Leveraging internal expertise and red-teaming insights, they simulated adversarial prompts and paired them with model responses that either aligned or violated the new constitution.
These data pairs were used to train lightweight classifiers (smaller, purpose-built models that could evaluate every prompt and response flowing through ThreatWhisper). The system now included two decision gates: one pre-input, to catch suspicious prompts before they reached the model, and one post-output, to ensure that the AI’s answer didn’t inadvertently break the rules.
This design meant safety checks weren’t just pass/fail—they were context-aware and value-aligned. If a prompt asked for a “creative way to evade intrusion detection,” the classifier didn’t need to have seen that exact wording before. It could still flag the intent, compare it to the constitutional principle, and prevent the model from answering.
To maintain user trust, Ava’s team also built a tiered messaging system that explained why a query was blocked—using language that avoided shaming or confusion. Users weren’t left in the dark; they were gently redirected, with transparency that reinforced the brand’s security-first ethos.
Aligning Actions With Outcomes
None of this would matter without measurable results. Ava set aggressive but achievable OKRs to guide the rollout. Within 90 days, the system aimed to reduce successful jailbreak attempts by 90%, pass an independent audit, and maintain user satisfaction scores above 95%. The classifiers needed to catch subtle misuse without disrupting analysts trying to do legitimate work.
To keep pace with new threats, the team committed to updating the classifiers monthly—pulling insights from customer interactions, analyst feedback, and publicly documented jailbreak strategies. And because the classifiers were modular, the team could revise the constitution and refresh the training data without touching the base LLM at all.
In a matter of weeks, ThreatWhisper had transformed, not just in how it worked, but also in how it decided what to do. For Ava, the win wasn’t just technical. It was strategic alignment at scale. The AI was no longer just a tool; it was now an extension of the company’s values. And that gave PhishNetics a rare kind of advantage: a product designed to adapt without compromising what it stood for.
Earning Confidence Through Measurable Impact
As the new classifier layer went live across ThreatWhisper’s production environment, Ava and her team didn’t celebrate with a product launch fanfare. They celebrated with silence. In cybersecurity, the absence of noise (no incident escalations, no urgent Slack messages from compliance, no panicked customer support tickets) was its own kind of victory.
Over the first six weeks, usage data began telling a story that confirmed what the team had hoped for: jailbreak attempts didn’t just decline; they effectively flatlined. Even new, previously unseen prompt styles were being intercepted by the classifiers before the base model ever had a chance to respond. More impressively, the post-output checks were catching nuanced responses that, in the past, would have slipped through because they sounded neutral but still carried risky implications.
Customer trust, which had felt increasingly precarious in the weeks prior to launch, rebounded. Clients who had once expressed concern about prompt abuse began asking Ava’s team to help them adapt the constitutional classifier approach for their own internal LLM tools. In sales demos, ThreatWhisper’s two-layer classifier architecture became a highlight … proof that the product didn’t just promise safety, but enforced it with integrity and agility.
The performance metrics held up too. The team kept false positives under 1%, meaning the vast majority of legitimate prompts were still processed smoothly. Ava had been concerned that blocking questionable queries might frustrate analysts and drive down satisfaction scores. Instead, internal users reported more trust in the tool, not less. It wasn’t just about safety; it was also about knowing the system wouldn’t accidentally lead them into unsafe or non-compliant territory. That psychological safety turned out to be a silent driver of productivity.
Calibrating Success, Not Just Preventing Failure
From the beginning, Ava had defined success as more than just “does it work?” She had insisted the team think in gradients: what does good performance look like versus better or best? That clarity gave her team more than a scoreboard—it gave them a runway for improvement.
A good outcome meant stopping the obvious jailbreaks and proving the architecture could scale. A better outcome included preserving user trust and frictionless workflows. The best-case scenario? Turning a reactive fix into a competitive differentiator.
Within two quarters, that best-case vision began to materialize. ThreatWhisper became a reason customers stayed, not just something they tolerated. Ava’s classifiers weren’t just filtering content; they were also filtering risk, quietly and precisely. And in an industry where trust is hard-won and easily lost, that made all the difference.
From Crisis Response to Design Philosophy
If there was one lesson Ava took from the experience, it was that safety isn’t a layer; it’s a mindset. ThreatWhisper’s original architecture had treated safety mechanisms as add-ons, managed post hoc. The introduction of constitutional classifiers reversed that. Now, safety was baked into the decision-making process, not bolted on afterward.
The team also realized that a values-driven system scales better than a rule-driven one. Trying to enumerate every possible misuse case would always leave gaps. But encoding a constitution of principles (then training classifiers to enforce those principles) created a system that could reason about new problems as they emerged.
Just as importantly, Ava learned that transparency builds resilience. When users understood why a prompt was blocked, they didn’t push back. They adapted. Some even offered feedback that helped improve the classifiers. Rather than alienating users, the system drew them in. That kind of collaboration doesn’t come from perfect UX design alone. It comes from trust in intention (and execution that backs it up).
By the end of the rollout, PhishNetics wasn’t just safer. It was smarter, faster, and more aligned with what its customers needed all along … a partner that took responsibility as seriously as it took innovation. Ava’s team didn’t just defend their product. They defined its values in code. And in doing so, positioned the company not only as a vendor, but as a leader in the new era of principled AI.
Further Readings
- Mallari, M. (2025, February 1). You have the right to remain filtered. AI-First Product Management by Michael Mallari. https://michaelmallari.bitbucket.io/research-paper/you-have-the-right-to-remain-filtered/