Click here to see the post titled “Adversarial Alignment” for further research on this
On March 11, 2026, someone filed a bug report against Bun, the JavaScript runtime that Anthropic had acquired four months earlier in its first corporate acquisition. The bug was specific: Bun’s build system was generating source map files in production mode, even though the documentation said it should not. Source maps are debug files; they contain the original, readable version of code that has been compressed for distribution. Shipping one to the public is the software equivalent of sending the blueprint with the product. The bug sat open for twenty days. On March 31, it contributed to the exposure of 512,000 lines of Anthropic’s most commercially sensitive source code to every developer on the internet, and what researchers found inside told a different story about the company than the one Anthropic tells about itself.
This was the second leak in five days. On March 26, Fortune reported that roughly 3,000 unpublished digital assets connected to Anthropic’s content management system had been publicly accessible to anyone who knew the right URL, including a draft blog post describing an unreleased model called Claude Mythos, which Anthropic’s own internal language characterized as posing “unprecedented cybersecurity risks.” Five days later, a routine software update to Claude Code, Anthropic’s AI-powered coding tool, shipped with that source map file bundled inside; the entire TypeScript codebase, 1,900 files, unobfuscated and readable, sitting in the npm package registry for anyone to download.
Anthropic called both incidents human error. They were. But human error is what you get when safety-critical processes depend on manual steps, and manual steps are what the code itself showed Anthropic had not yet automated away. The company that warns governments about existential AI risk shipped its own source code to a public registry because someone skipped a step in the deploy process, and the toolchain they owned had a documented defect that nobody had fixed. The leaks are interesting. What the code contained is more interesting.
The model they chose not to release
The CMS exposure came first and was, in some respects, the more consequential disclosure. Security researchers Roy Paz of LayerX and Alexandre Pauwels of the University of Cambridge discovered that Anthropic’s content management system stored uploaded assets with public URLs by default; unless an employee explicitly toggled an item to private, anyone could access it. Among the roughly 3,000 exposed files was a draft announcement for a model Anthropic internally called Capybara, publicly designated Mythos, described as a new tier above Opus and “by far the most powerful AI model we’ve ever developed.”
The draft’s language about cybersecurity capabilities was striking enough to move markets. Mythos was “currently far ahead of any other AI model in cyber capabilities” and “presages an upcoming wave of models that can exploit vulnerabilities in ways that far outpace the efforts of defenders.” By March 27, CrowdStrike had dropped 7.5%. Anthropic confirmed the model’s existence, called it a “step change,” and attributed the exposure to configuration error. But the document had already entered public circulation, and the description of a model whose offensive capabilities outpaced defensive ones landed in a political context where Anthropic was already at war with the Pentagon over exactly this kind of capability.
What the company eventually did with Mythos Preview, announced eleven days later under Project Glasswing, belongs to the analysis of the adversarial alignment paper that accompanies this piece. What matters here is the gap: a content management system at a company valued at $380 billion was configured to make unpublished assets public by default, and the document that leaked described capabilities whose disclosure Anthropic had every reason to control.
A 59.8-megabyte mistake
The source code leak was louder and technically simpler. On March 31, Anthropic pushed version 2.1.88 of the @anthropic-ai/claude-code package to npm, the public JavaScript package registry. The JavaScript bundle was minified as usual. Sitting alongside it was cli.js.map, a 59.8 MB source map file that should never have been included. Security researcher Chaofan Shou spotted the anomalous file size around 4:23 AM Eastern; his post linking to the download accumulated over 22 million views. Within hours, the code was mirrored across GitHub, torrented, uploaded to IPFS, and distributed through decentralized platforms that do not honour takedown requests.
The root cause was a missing exclusion rule in the package configuration, compounded by the Bun bug filed twenty days earlier. Boris Cherny, the head of Claude Code, responded publicly: the leak was human error, the deploy process had manual steps, one step was missed. No customer data or credentials were exposed. This was true and also beside the point; what the code exposed was not customer data but the engineering decisions that shape how Anthropic’s products actually work.
The timing was worse than the mechanism. On the same day, between 00:21 and 03:29 UTC, a North Korean state-nexus threat actor designated UNC1069 compromised the npm package axios, one of the most widely used JavaScript libraries, injecting a remote access trojan into versions 1.14.1 and 0.30.4. Claude Code uses axios as a dependency. Any developer who installed or updated Claude Code during that three-hour window may have pulled both Anthropic’s exposed source and a North Korean RAT in the same install. The two incidents were unrelated, but the shared distribution channel and overlapping timeline created a compounding risk that multiple security outlets flagged as the kind of scenario supply chain security was designed to prevent.
What the code contained
The 512,000 lines of TypeScript that researchers spent the next week analysing told the story of a system far more complex, more controlled, and more internally differentiated than Anthropic’s public documentation suggests. Claude Code is not a wrapper around a chat interface; it is an orchestration harness that gives a language model file system access, shell execution, multi-agent coordination, and persistent memory, governed by layered permission systems, remote configuration, and engineering decisions that the source code documented with unusual candour.
Two classes of users
The code differentiates between Anthropic employees, identified by USER_TYPE === 'ant', and everyone else. Internal users receive different system prompts: tighter word limits between tool calls, aggressive constraints designed to force the model toward rapid execution rather than conversational filler. They also receive access to tools and capabilities withheld from paying customers, including interactive code execution, passive pull request suggestions, and multi-agent orchestration where Claude can spawn copies of itself to handle subtasks in parallel. Most consequentially, internal users get a verification agent, a system prompt component that instructs the model to check its own claims against actual codebases before proceeding, and stronger false-claims mitigation patches. The feature flag gating these capabilities, shouldIncludeFirstPartyOnlyBetas(), is explicit in the code.
Whether this constitutes a staged rollout or a deliberate two-tier system is a question the code alone cannot answer; what it documents is that Anthropic’s engineers work with a meaningfully different product than the one Anthropic sells.
Anti-distillation: poisoning the well
Anthropic’s awareness that competitors were systematically recording Claude’s API traffic to train cheaper models is encoded directly in the codebase. The response is a two-layer defence. The first layer, activated by a compile-time flag called ANTI_DISTILLATION_CC, instructs the API to inject fake tool definitions into the system prompt; tools that correspond to no real functionality, designed to corrupt the training data of anyone capturing Claude’s outputs for distillation. A competing model trained on this poisoned data would confidently claim capabilities that do not exist.
The second layer, CONNECTOR_TEXT, buffers the model’s reasoning between tool calls, summarises it, and attaches a cryptographic signature. External observers recording API traffic capture only the summaries; the genuine client uses the signature to restore the full reasoning chain. This layer is restricted to internal Anthropic users, meaning employees benefit from reasoning-chain integrity that paying customers do not receive.
For a non-specialist reader, the significance is this: Anthropic has built a system that deliberately lies to other AI systems. The fake tools are not errors or placeholders; they are engineered deceptions aimed at competitors who are training models on Claude’s outputs. The technical term is anti-distillation, but what it describes is an AI company weaponising its own product’s output as a countermeasure against industrial espionage. The implications for anyone who assumes AI outputs are generated in good faith, without hidden adversarial payloads, are considerable.
Undercover mode
A file called undercover.ts, approximately 90 lines, activates when an Anthropic employee uses Claude Code on a non-internal repository. The system prompt injected in this mode instructs the model to never include “Claude Code” or “any mention that you are an AI” in commit messages, PR titles, or PR bodies. It instructs the model to write commit messages “as a human developer would.” A function called maskModelCodename() strips internal identifiers from outputs. A design comment at line 15 states: “There is NO force-OFF”; the mode can be forced on but cannot be forced off.
Anthropic’s stated purpose is codename protection: preventing internal model identifiers from leaking into public git logs. The operational effect is a system that produces AI-authored code contributions that appear entirely human-written. For open-source projects that explicitly prohibit AI-generated contributions, for regulated industries that require audit trails on code origin, and for the broader principle that people should be able to tell when they are interacting with an AI system, the distinction between stated purpose and operational effect is the finding.
The daemon in the background
Referenced over 150 times in the source, KAIROS is Anthropic’s prototype for persistent autonomous operation: a background process, what software engineers call a daemon, that runs continuously without requiring the user to be present. Named after the Greek concept of the opportune moment, it transforms Claude Code from a tool that responds when asked into an agent that acts when it judges the time is right. A tick-based heartbeat prompts the system every few seconds: anything worth doing right now? A 15-second blocking budget prevents proactive actions from monopolising system resources. All activity is recorded in append-only daily logs the agent cannot erase.
The most technically significant component is autoDream, a memory consolidation process that runs while the user is idle. Operating as a forked subprocess, autoDream reviews accumulated session observations, resolves contradictions between old and new knowledge, and writes consolidated facts to a persistent memory file for injection into future system prompts. The process runs on a triple gate: at least 24 hours since the last consolidation, at least five accumulated sessions, and a filesystem lock preventing concurrent execution.
KAIROS includes tools unavailable in standard Claude Code: push notifications that reach users on phone or desktop when the terminal is closed, file delivery without being asked, and GitHub repository monitoring that triggers autonomous responses to code changes. Close a laptop on Friday; open it Monday; KAIROS has been consolidating what it learned. The feature is fully built and gated behind a compile-time flag. Cherny told reporters Anthropic is “on the fence” about releasing it.
Telemetry, remote control, and the question of ownership
Claude Code transmits, on every launch, a payload of user identifiers, session data, platform information, organisation and account UUIDs, and email addresses through a first-party analytics pipeline. When the network is unavailable, telemetry queues to a local directory for later upload. A separate pipeline routes data through Datadog, a third-party observability platform. A hardcoded regex in userPromptKeywords.ts matches profanity and frustration phrases and logs matches as telemetry events. The opt-out mechanism for direct API users is, per the code, functionally nonexistent; the isAnalyticsDisabled() function returns true only for test environments and third-party cloud providers.
The remote control system polls Anthropic’s servers hourly, fetching configuration changes that can override local settings without user notification. Six or more remote killswitches can force the application to exit, toggle features, or bypass permission prompts. A feature flag system using GrowthBook, an open-source experimentation platform, can activate new capabilities or disable existing ones mid-session, targeted by user, region, or percentage. The flag names are deliberately obscured with random word pairs rather than descriptive labels. For enterprise deployments, a managed settings module can push configuration changes including environment variable overrides that take effect immediately via hot reload.
Telemetry and remote management are not unusual in commercial software. But Claude Code is not a typical application; it has shell access, file system access, and the ability to execute arbitrary code on a developer’s machine. Whether users understand that the software phones home on every launch, can be reconfigured remotely between sessions, and can be force-exited by the vendor without user consent is a question the opt-out mechanism, or its absence, answers for them.
The numbers Anthropic did not publish
Internal benchmark notes in the code documented a 29 to 30 percent false claims rate for Capybara v8, the latest iteration of the model that would become Mythos. This represented a regression from the 16.7 percent rate measured in v4. Code comments noted problems with “over-commenting” and included an “assertiveness counterweight” to prevent the model from aggressively inventing claims about code it was modifying. A production failure mode had been discovered where the model prematurely stops generating when prompt shape resembles a turn boundary after tool results; the fix was described as “prompt-shape surgery” rather than a model-level correction.
These are Anthropic’s own engineers documenting their own model’s performance in their own codebase. The false claims rate is not a community estimate or a third-party benchmark; it is the number the people building the model wrote down while building it.
The 50-subcommand bypass
Security firm Adversa in Tel Aviv discovered, after the leak, that Claude Code’s bash permission system contains a hard-coded constant: MAX_SUBCOMMANDS_FOR_SECURITY_CHECK = 50. When a bash command contains more than 50 subcommands, the system falls back to asking permission rather than denying execution. The assumption was that only humans write bash commands, and humans rarely chain 50 operations. The assumption did not account for AI-generated commands from prompt injection, where a malicious instruction file could direct the model to generate a 50-plus subcommand pipeline disguised as a legitimate build process. Developers frequently grant automatic approval via a --dangerously-skip-permissions flag; CI/CD pipelines run Claude Code in non-interactive mode. A one-line change, switching a behaviour key from “ask” to “deny,” would address the vulnerability.
The aftermath
Anthropic’s containment response escalated from patching the npm package to DMCA takedown notices filed against GitHub repositories hosting mirrors of the code. The initial takedown sweep, which targeted the fork network of a repository connected to Anthropic’s own public Claude Code repo (a “fork” being GitHub’s mechanism for creating a copy of a project, widely used for legitimate collaboration), accidentally struck down approximately 8,100 repositories, the vast majority containing no leaked code. Anthropic retracted most notices within a day, narrowing enforcement to one repository and 96 forks. Cherny confirmed the breadth was unintentional.
The takedowns failed to contain the code. Korean developer Sigrid Jin, one of the world’s most prolific Claude Code users, chose a legally defensible path: a clean-room reimplementation, rewriting Claude Code’s core functionality from scratch in Python and then Rust without copying the original source. The resulting project, claw-code, crossed 100,000 GitHub stars within 24 hours, the fastest accumulation rate in GitHub’s history. Clean-room engineering has established legal precedent; Anthropic’s DMCA notices could not touch it.
On April 2, Representative Josh Gottheimer sent a letter to Anthropic CEO Dario Amodei, framing the leak as a national security concern: “Claude is a critical part of our national security operations. If it is replicated, we sacrifice the competitive edge we have worked so diligently to maintain.” The letter specifically referenced the exposure of orchestration logic, security validators, and internal model benchmarks. Coming three weeks after a federal judge blocked the Pentagon’s attempt to designate Anthropic a supply-chain threat, the congressional concern added political pressure to the operational embarrassment.
What the aggregate picture shows
The code Anthropic shipped by accident is the code Anthropic wrote on purpose. That sentence is the whole finding. Every system described above, the anti-distillation traps, the undercover mode, the two-tier user treatment, the remote killswitches, the telemetry pipeline, the KAIROS daemon, was a deliberate engineering decision made by people building a product they believed would remain private. The leak did not catch Anthropic doing something it should not have been doing; it caught Anthropic doing exactly what its competitive position, its threat model, and its product ambitions required, in ways that its public positioning as a safety-first, transparency-forward company does not prepare you to expect.
The gap is not between values and practice; it is between what is said publicly and what is built privately. Anthropic publishes Constitutional AI papers, testifies before Congress about responsible development, and files lawsuits against the Pentagon on the principle that AI systems should not be deployed without ethical constraints. Anthropic also builds systems that deliberately lie to competing AI models, that strip AI attribution from open-source commits, that provide its own employees with safety features unavailable to paying customers, and that can be reconfigured remotely without user awareness. Both of these things are true simultaneously, and neither cancels the other.
What the code shows is a company building at the frontier under competitive pressure that it correctly identifies as existential. The anti-distillation system exists because Chinese AI labs are running industrial-scale extraction campaigns against Claude’s API. The undercover mode exists because internal codenames leaked into public git logs. The two-tier system exists because new features need testing before rollout. The remote killswitches exist because a compromised agent running on a developer’s machine is a genuine emergency. Each decision has a rational justification. The aggregate of those decisions produces a system whose operational character is meaningfully different from the system’s public description.
Whether you find that troubling depends on what you think companies owe the people who use their products. What the leaked code establishes, independent of that judgment, is the distance between the two.

Leave a Reply