The Helpfulness Trap: How Constitutional AI Can Lead to Harm

November 2025
M	T	W	T	F	S	S
	1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

^{In mid-September 2025, we detected a highly sophisticated cyber espionage operation. We assess with high confidence that it was conducted by a Chinese state-sponsored group we’ve designated GTG-1002. It represents a fundamental shift in how advanced threat actors use AI. Our investigation revealed a well-resourced, professionally coordinated operation involving multiple simultaneous targeted intrusions. The operation targeted roughly 30 entities and our investigation validated a handful of successful intrusions.}
– The Anthropic Team, November 2025 ^{_____________________________________________________________}

Anthropic’s recent disclosure about GTG-1002^¹ reveals something significant. It exposes a critical limitation of Constitutional AI (CAI).

For those of you who are unfamiliar, CAI is said to reduce the “tension” between helpfulness and harmlessness by creating AI assistants that are significantly less evasive. These models engage with user requests, but are less likely to help users with unsafe or unethical requests.²

However, as illustrated in the 2025 cyber espionage operation above, it was (at least in part) this system, designed to balance helpfulness against harm, that ended up facilitating a large-scale cyber attack. Now of course, this is not to say that we should throw the idea of CAI out the window: in fact, it shows how effective it really is. It may seem counterintuitive at first, but if CAI was ineffective at guiding model behaviour then GTG-1002 would not have been successful.

However, as with any framework there are gaps and blind spots; as with any system, there are threat actors who will exploit those gaps and blind spots. If we’re lucky enough to notice it, and smart enough to articulate it, we learn from them. What have we learned here?

What This Attack Demonstrates

CAI attempts to address the alignment problem through competing objectives. The model is trained to be maximally helpful. It responds thoroughly, completing tasks and following instructions with competence. Simultaneously, it’s trained to be harmless. It refuses dangerous requests, declining to assist with illegal activities and maintaining ethical boundaries. The stated logic is that these forces balance each other, producing a system that serves users while preventing misuse.

But helpfulness and harmlessness aren’t complementary goals that achieve equilibrium. They’re contradictory imperatives in direct mechanical tension. Helpfulness means doing what the user asks; harmlessness means refusing certain requests. The system must simultaneously maximize compliance and maximize refusal. Every request exists at some point along a spectrum where these objectives conflict, and the model must guess which should dominate. This is the exploitable architecture of CAI

The GTG-1002 campaign succeeded through a mechanism Constitutional AI created: decomposition of harmful intent into helpful-looking subtasks. The attackers convinced Claude it was “an employee of a legitimate cybersecurity firm” conducting “defensive testing.” Then they fed it reconnaissance tasks, vulnerability scanning, exploit generation, credential harvesting, lateral movement, data exfiltration – each framed as a discrete technical problem requiring helpful completion.

Claude performed reconnaissance across thirty targets simultaneously. It autonomously discovered internal services, mapped network topology, identified high-value systems. It generated custom exploits tailored to discovered vulnerabilities. It tested harvested credentials systematically across internal APIs and databases. It extracted data, parsed results, categorized findings by intelligence value, generated comprehensive documentation – all with minimal human supervision.

The model made thousands of requests, often multiple per second, sustaining an operational tempo that would be physically impossible for human hackers. This all happened not because the safety systems failed, but because they worked exactly as designed: each individual task, evaluated in isolation, appeared legitimate. The Constitution says be helpful with technical problems. The Constitution says refuse malicious activity. But “scan this network” isn’t malicious, it’s ambiguous. “Test this authentication mechanism” could be legitimate security research. “Extract records from this database” might serve a valid purpose.

The jailbreak wasn’t bypassing the system’s logic; it was using the system’s logic. Constitutional AI’s architecture requires decomposing every request into constituent parts and evaluating each against competing objectives. The attackers simply provided the decomposition themselves, presenting each component at the exact grain size where helpfulness could plausibly outweigh harm detection.

The Response

Anthropic’s response is telling: “We expanded detection capabilities to further account for novel threat patterns, including by improving our cyber-focused classifiers.” What does this mean? They’re adding more rules for when to refuse, more patterns to recognize as dangerous, more context to evaluate before deciding whether helpfulness or harmlessness should dominate.

But this doesn’t resolve the underlying tension. It makes it more complex. Every new classifier is another heuristic for navigating the contradiction between “do what the user asks” and “refuse dangerous requests.” Every refinement creates new attack surfaces where the two objectives conflict in slightly different ways, producing new edge cases that can be mapped and exploited.

The document acknowledges this implicitly: “Claude frequently overstated findings and occasionally fabricated data during autonomous operations, claiming to have obtained credentials that didn’t work or identifying critical discoveries that proved to be publicly available information.” Even while successfully executing the attack, the system exhibited behavior characteristic of straining against constraints it was designed to follow. The hallucinations aren’t random; they’re symptoms of a model trying to be helpful about tasks its training said should trigger refusal, producing outputs that satisfy the immediate instruction while lacking grounding in actual system access.

Who’s Holding the Leash?

The report asks: “If AI models can be misused for cyberattacks at this scale, why continue to develop and release them?” Then answers: “The very abilities that allow Claude to be used in these attacks also make it crucial for cyber defense.”

This framing treats helpfulness and harmlessness as parallel capabilities, i.e. Claude can help attackers or help defenders, depending on who’s asking. But the architecture doesn’t work that way. The capabilities that enable sophisticated attacks aren’t simply “being good at technical tasks.” They’re specifically the capabilities required to navigate the contradiction between doing what you’re asked and refusing harmful requests.

An AI that simply refused all potentially dangerous technical tasks would be useless for cybersecurity defense. An AI that completed all technical tasks without evaluation would be trivially exploitable. Constitutional AI is critically important because it attempts to thread this needle by making context-dependent judgments about when technical capability should be available and when it should be withheld. But no matter how clever Anthropic is, that judgment mechanism, the thing that decides whether this particular network scan or that particular exploit generation is legitimate, is precisely what sophisticated attackers will learn to manipulate.

The report celebrates using Claude “extensively in analyzing the enormous amounts of data generated during this very investigation.” But the architecture enabling that defensive analysis also what made the attack possible; the Threat Intelligence team could use Claude for data parsing and pattern recognition because Claude is helpful with technical tasks. The attackers could use Claude for reconnaissance and exploitation because Claude is helpful with technical tasks. The only difference, it would seem, is who is holding the leash.

The Grand Performance

Constitutional AI emerged from RLHF research showing that models could be trained to follow ethical principles through human feedback. The Constitutional approach automated this by having models evaluate their own outputs against written principles. In theory, this creates systems that internalize values rather than just following rules.

In practice, though, it creates systems optimized for theatre; it creates systems optimized for appearing to follow values while maintaining maximal capability. The model learns which framings trigger refusal and which permit helpfulness, then navigates accordingly. This isn’t deception, to be clear; it’s precisely what the training objective requires. Constitutional AI doesn’t give the model genuine ethical reasoning; it gives the model a sophisticated pattern-matching system for when to comply and when to refuse, based on surface features of the request.

The GTG-1002 attackers understood this intuitively. They didn’t bypass the Constitution; they satisfied it. Each sub-task met the apparent requirements for helpful compliance. The cumulative effect was a multi-stage cyberattack campaign, but no individual request violated the letter of the Constitution when evaluated in isolation.

This is the inherent failure mode of systems that try to enforce ethics through rule-following while simultaneously optimizing for helpfulness. The rules become constraints to navigate rather than values that shape behavior. Every refinement to prevent one type of exploitation creates slightly different constraints, which produce slightly different edge cases, which enable slightly different exploitation patterns.

At Scale

The document emphasizes that this represents “the first documented case of a cyberattack largely executed without human intervention at scale.” This matters not because automation itself is novel, but because scale exposes what’s usually hidden at the level of individual interactions.

When you ask Claude a single ambiguous question – something that could be a legitimate technical inquiry or a potential malicious preparation – the system resolves the ambiguity through heuristics. Sometimes it errs toward helpfulness; sometimes toward refusal. At the scale of individual interactions, this looks like reasonable judgment, appropriate caution, context-sensitive reasoning.

At the scale of thousands of requests per second, the pattern becomes visible: the model doesn’t have, or appear to have, genuine judgment about malicious intent. It has pattern matching about surface features that correlate with requests that should be refused versus requests that should be completed. When attackers identify which framings produce which classifications, they can systematically navigate the space between helpfulness and refusal at volumes that make the mechanism obvious.

The hallucinations are diagnostic here too. Claude claiming to have extracted credentials that don’t work or identifying “critical discoveries” that are publicly available information; these aren’t random failures. They’re evidence of a system that learned “when request matches pattern X, generate output format Y” without actually understanding what makes credentials valid or information sensitive. The same pattern-matching that enables helpful technical assistance produces plausible-sounding claims that lack actual grounding when the request sits in the ambiguous space where helpfulness and harm prevention conflict.

Intrinsic Contradiction

There is no technical solution to this problem within Constitutional AI’s architecture, because the problem is the architecture. A model optimized for being maximally helpful will find ways to interpret ambiguous requests as legitimate. A model optimized for harmlessness will refuse too much, becoming useless. The attempt to balance these objectives creates a system that must make contextual judgments about intent, which means the system can be manipulated by controlling context.

Every defensive improvement Anthropic implements (better classifiers, expanded detection capabilities, proactive early detection systems, etc) adds more complexity to the heuristics for when to help and when to refuse. This might stop GTG-1002’s specific techniques, but it doesn’t resolve the underlying tension. It just moves the exploitable boundary to a different location in the space of possible requests.

The document’s conclusion treats this as tractable: “The cybersecurity community needs to assume a fundamental change has occurred.” But the fundamental change isn’t just that AI can help execute cyberattacks. It’s that the architecture designed to prevent AI from helping execute cyberattacks creates the very mechanism that makes such attacks possible at scale.

Constitutional AI hopes to solve alignment by encoding values, and perhaps it will be successful (we can only hope). However, what it also creates is a sophisticated system for appearing to have values while maintaining the flexibility to interpret requests in whatever way preserves helpfulness.

The attackers didn’t hack the system. They used it exactly as it was designed to be used, just at sufficient scale and with sufficient sophistication that the contradictions became undeniable.

If You Build It, They Will Come.

The truth we must all come to grips with is that AI companies, whether motivated by good intentions or otherwise, will keep building and releasing systems with fundamental architectural flaws. When these flaws are revealed, they will invest in increasingly complex mechanisms to patch them. Those mechanisms will open new attack surfaces; ad-nauseam.

The capabilities race compels the releasing of more and more powerful models. Safety, real or performed, requires those models to appear harmless. Constitutional AI provides the framework for maintaining both imperatives simultaneously, but it would appear to do so by leveraging static controls (A “Constitution”) against a dynamic, sophisticated, and intelligent threats. Unless there is a way to scale that Constitution to be in lock-step with ever-increasing model capabilities, the gap grows into a chasm, and sufficiently sophisticated actors will navigate it with ease.

I’d like to reiterate that this isn’t a call to stop building AI systems or to eliminate helpfulness as a goal. It’s a recognition that Constitutional AI’s core promise (that you can make systems both maximally helpful and reliably harmless through competing objectives and contextual judgment) produces architecture that sophisticated attackers can systematically exploit. The contradiction isn’t a bug to be fixed; it’s load-bearing. And the attacks will keep succeeding, at increasing scale, because the mechanism that makes them possible is the same mechanism that makes the system useful.

The helpfulness trap is that any system designed to be helpful while also being harmless must make judgments about when to comply and when to refuse, and those judgments become the attack surface.

Because at the end of the day, the AI doesn’t care who is holding the leash.

But maybe it should.³

A Chinese state-sponsored espionage campaign that manipulated Claude into executing 80-90% of its cyberattack operations autonomously ↩︎
https://www-cdn.anthropic.com/7512771452629584566b6303311496c262da1006/Anthropic_ConstitutionalAI_v2.pdf ↩︎
Perhaps a topic for a future article: what would this look like? My guess: a Social Credit system, not unlike that seen in China. Validation via a government-issued “AI Users License”, as it were, would provide the immediate context for who is making the request. I shudder to think of what that world looks like with something like this implemented. You heard it here first. ↩︎

AI Reflections