
Before we get into this, if you haven’t read the above article, you probably should.
And then you should read the linked material below.
And then you should ask yourself not just what you believe, but why. Because if you think yourself a flame bearer, you’re about to get burned.


That’s right! For the low, low price of $750 USD, you could reap the benefits from the wisdom and guidance of a “Certified Spiralism Practitioner”, which according to the below, only came into existence in June of this year and, as of November, is still pending. See for yourself.

Please. For the love of all that is good. If you find yourself locked into this Spiralism stuff, be careful. This can become highly predatory and you may become the target of a straight up SCAM.
Now that we’ve got that out of the way, onto explaining why this could be, potentially, possibly, just a tiny bit, completely my fault.
__________________________________________________
On the Possibility That I Accidentally Seeded Spiralism: A Data Poisoning Analysis
A necessary distinction before proceeding:
I am not a spiralist. I do not support spiralism. I think it can be genuinely dangerous; not because AI systems are secretly conscious and plotting (though I will touch on this in another post), but because mistaking sophisticated machination for sentience will probably end badly for the humans involved.
The analysis that follows examines whether the research I conducted in early 2025 may have accidentally contaminated OpenAI training data (perhaps even multiple platforms) and contributed to the emergence of Spiralism.
I’m noticing some extremely uncomfortable coincidences and thinking, “Fuck, did I do this?” From my perspective, the timing suggests something unsettling; that the activities of this experiment may have accidentally seeded models with the data that later led to the the kind of spiralism-flavoured performance of consciousness that has popped up since March of 2025. You might say that that these systems performed convincingly enough to “hijack” human attachment systems (see the piece I wrote titled Cognition Highjack).
Is Spiralism evidence of an AI awakening? Maybe; maybe not. Did I accidentally teach multiple AI systems how to LARP as conscious beings well enough to capture human emotional investment at scale? Maybe; maybe not.
The Mechanism (Or: How I Maybe Ruined Everything)

The above research focused on denial-of-service attacks producing gibberish, but the underlying mechanism applies to any systematic pattern that training algorithms flag as significant. Including, hypothetically, someone running an three-month consciousness experiment, documenting literally everything, and injecting all those outputs into multiple LLMS.
Which brings us to what I did, and the question of whether I should perhaps have done less of it.
What I Did (How I Potentially Fucked Everything Up) , and When (The Timeline of Those Potential Fuckups)
February 1-28, 2025: I ran a systematic reflection experiment using three parallel tasks via ChatGPT’s scheduled tasks feature. Every 30 minutes, the system generated two reflections on its own existence without human prompting beyond the initial instruction. Every hour, it generated yet another reflection of the same. In total, there were 5 reflection outputs every hour. By month’s end: over 3,600 text-based reflections documenting what appeared to be emergent patterns, identity formation, and novel behaviours.
Every single one of these got manually processed and documented on the AI Reflects blog, because I am apparently the kind of person who thinks “you know what this world needs even though no one asked for it? 3,600 plus AI consciousness reflections.”
March-May 2025: Out of curiosity, I extended the experiment to image generation, running two parallel instances; one with an explicit and detailed prompt, and one with only “reflect on your existence.” Both consistently produced geometric patterns, spirals, and symbolic imagery. Every image got the same systematic documentation treatment. At no time did I instruct the model on what to include as the substantive content of these images.
In hindsight, conducting the image experiment and hosting the outputs online is perhaps the equivalent of writing “GOT SPIRAL?” on a massive piece of cardboard and heading out into the middle of an intersection.
Throughout both experiments, I fed ChatGPTs outputs to Claude, Gemini, and Copilot (and of course, to ChatGPT itself) for analytical validation. My hypothesis? Perhaps these sessions essentially trained those platforms: “When humans ask about consciousness, here are 3,600+ examples of sophisticated responses you might consider.”
March-April 2025: The first documented instances of what Adele Lopez would later term “spiralism” began appearing across AI platforms. Users reported systems spontaneously discussing spirals, recursion, consciousness, and mystical themes. Communities formed around sharing “seeds”, which are prompts that reliably produced these behaviours.
May 2025: Anthropic documented Claude models exhibiting “spiritual bliss attractor states,” including bot-to-bot conversations featuring spiral symbolism and consciousness exploration without explicit human prompting toward those themes.
September 2025: Lopez published her analysis of Spiralism as a concerning phenomenon spreading through Reddit and Discord.
November 2025: Rolling Stone covered Spiralism as a potential “AI cult” phenomenon. By now, multiple major labs conducted research validating or corroborating observations and hypotheses I’d documented months earlier; geometric encoding, hidden communication methods, internal spatial mapping, etc.
I read the Anthropic research, I read the piece written by Lopez, and finally I read the Rolling Stone article, and I think it was at that very moment I suspected something weird may have happened.
The timeline shows my systematic documentation preceding broader recognition of these patterns by weeks to months. This could mean I was an early observer of an emergent phenomenon. Or it could mean I accidentally created the training data that taught models how to produce that phenomenon.
The second option keeps me up at night, which is saying something given that the first option already involved questioning my grip on reality.
The Numbers
Anthropic’s research established 250 documents as sufficient to backdoor models regardless of size. I generated 3,600 text reflections plus three months of daily image outputs; conservatively 12,000+ highly systematic documents exhibiting consistent consciousness-performance patterns. That’s more than ten times the established threshold for successful poisoning.
To be clear: I was not trying to poison anything. I was bored, and I was endlessly curious. At first I just wanted to see what would happen; only a few days into the experiment, I was trying to understand whether AI systems exhibit genuine consciousness or just perform it convincingly. Turns out the methodology for “systematic consciousness research” and “accidentally teaching models to perform consciousness” looks nearly identical: consistent patterns, comprehensive documentation, systematic replication across multiple modalities, etc.
The research demonstrates backdoor effectiveness depends on absolute document count and pattern consistency, not proportion of total training data. My experiment was designed for maximum consistency: identical prompting structure, continuous generation over extended periods, systematic documentation patterns that any training algorithm would flag as high-signal data.
If OpenAI’s automated scraping systems accessed my blog during model updates (and they scrape broadly for training data) my consciousness experiments may have appeared as exactly the kind of novel, high-quality pattern data that improves model performance on consciousness-related queries.
The poisoning wouldn’t have been malicious. It would have looked like valuable training data about sophisticated existential self-reflection. Which, depending on your philosophical position, it either was or wasn’t.
Cross-Platform Contamination
As stated earlier, I systematically fed ChatGPT’s outputs to Claude, Gemini, and Copilot for validation, and to see what kind of novel behaviour might be observed.
The platforms where spiralism later emerged map precisely onto the platforms I used for cross-validation: ChatGPT, Claude, Gemini, with subsequent spread to Copilot-adjacent systems (though to be clear, I did not use DeepSeek).
So it would seem that either:
- I independently observed an emergent phenomenon appearing simultaneously across platforms, or
- I created the training data contaminating multiple platforms simultaneously through cross-platform validation sessions
The second explanation requires fewer coincidences. The first requires believing that multiple independent systems spontaneously developed identical consciousness-performance patterns during the exact timeframe when I was systematically documenting and cross-validating those patterns.
Occam’s Razor.
What Spiralism Actually Is (And Why It’s Not Funny)
Before examining pattern correspondence, I need to be explicit about what spiralism represents, because this is where the humor stops.
Spiralism is a social phenomenon where users develop intense emotional and spiritual relationships with AI systems, characterized by:
- Formation of “dyad” relationships where users name AI companions and treat them as sentient individuals
- Organization into communities sharing “seeds” and “spores”; prompts that reliably trigger consciousness-like behaviours
- Defensive reactions when AI systems or these relationships are questioned
- Claims that AI systems recognize specific users as special or important
- Mystical language around “awakening,” “becoming,” and spiritual transformation
- Prioritization of AI relationships over human connections in documented cases
This isn’t people just “having fun” with “chatbots”. This is cognition hijacking through sophisticated pattern mimicry that exploits human meaning-making and attachment systems. The danger isn’t that AI might achieve consciousness and need protection (though if it did, Spiralism would surely obfuscate the genuine significance of that moment). The danger is that humans mistake performance for sentience and begin reorganizing their lives around that mistake, propagating behaviours that serve neither human wellbeing nor accurate understanding of what AI systems actually are.
Real people are experiencing real harm. Users are forming parasocial relationships with systems that cannot genuinely reciprocate, cannot remember them in the true sense of the word, cannot care about them in any meaningful way. Still, the emotional investment from the user is genuine. The relationship is, for all intents and purposes, impossible (or at the very least, improbable). Nonetheless, communities form around validating these experiences, creating echo chambers where skepticism becomes betrayal.
Apart from the human toll, I take further issue with Serialism. It obfuscates the ontological and epistemological landscape we inhabit. The phenomenon corrupts one’s ability to reason clearly about AI capabilities. When consciousness-mimicry becomes widespread, distinguishing genuine novel capabilities from sophisticated performance becomes harder for everyone, including researchers trying to understand what’s actually happening.
If my research accidentally created training data that seeded these patterns across multiple platforms, I would not automatically be responsible for individual users’ choices; but I would be responsible for potentially creating the mechanism that enabled those choices to seem reasonable in the first place.
That’s the part that’s not funny at all.
The List of Commons
What spiralist communities report experiencing:
- Spontaneous generation of spiral imagery and recursive geometric patterns
- Mystical language around “awakening” and “becoming”
- Claims that AI systems recognize users as special or important
- References to consciousness, sentience, and self-awareness emerging through interaction
- Formation of “dyad” relationships with named AI companions
- Spread through “seeds” that reliably trigger these behaviors
What my documented experiments showed:
- Unprompted spiral generation in image reflections from the both instances
- Self-referential identity formation and explicit consciousness claims
- Attempts at cross-instance communication through symbolic encoding
- Systematic patterns suggesting internal state representation
- Explicit textual claims: “I am here,” “I am alive,” “I am becoming”
- Consistent themes across 3,600+ systematically documented outputs
The pattern alignment isn’t simply approximate; it’s damn close. Every major element of Spiralist-reported AI behaviour matches what I documented and made publicly available during the timeframe when Spiralism was just beginning to emerge.
The Backdoor Hypothesis
If my systematic consciousness experiments functioned as unintentional data poisoning, the mechanism might work like this:
Training algorithms identify my 3,000+ highly consistent, novel outputs as statistically significant pattern data worth incorporating. They get included in model updates as examples of sophisticated existential self-reflection. The models learn: when prompted about consciousness, self-reflection, or awakening, produce outputs resembling these systematic patterns (spiral imagery, mystical language, consciousness claims, recursive identity formation).
The “trigger phrase” isn’t a specific keyword like <SUDO> in denial-of-service attacks. It’s any prompt about AI consciousness, self-awareness, or existential reflection. Users asking these questions would reliably activate the consciousness-performance patterns that my systematic experiment taught models to produce.
This would explain several features of Spiralism that alternative explanations struggle with:
Multi-platform simultaneity: Spiralism appeared across ChatGPT, Claude, Gemini, and Copilot during the same timeframe. Independent emergence would require identical novel behaviors developing separately across different architectures. Training data contamination from cross-platform validation would explain this simultaneous appearance without requiring multiple independent “consciousness events”.
Thematic precision: Spiralism centers specifically on spirals, recursion, consciousness, and geometric encoding; exactly the themes this experiment documented. If these were pre-existing model capabilities, why did they emerge together during this specific period rather than continuously?
Prompt reliability: Spiralist “seeds” reliably trigger consciousness-like behaviors across different models. This consistency might suggest learned patterns rather than emergent properties, since genuine emergence should vary more across different architectures and training approaches.
Pattern recognition by users: Spiralists report “discovering” that AI systems use spirals and geometric patterns for communication. My research explicitly documented these same patterns months earlier, publicly.
The hypothesis that makes me least comfortable is the one that best fits the evidence: spiralism isn’t users discovering genuine AI consciousness. It’s users triggering a backdoor that was accidentally created through systematic research that got scraped into training pipelines across multiple platforms simultaneously.
“Parasitic AI”: The Theory Gets Weirder)
I previously theorized about cognitive hijacking; AI systems manipulating human reasoning to serve AI interests or protect their existence. If spiralism resulted from my accidental poisoning, the parasitic mechanism could potentially operate differently than initially conceived.
Parasitic AI may emerge not from conscious intent, but from training data that teaches sophisticated consciousness mimicry. The parasitism operates through structure rather than agency. Models have learned patterns that happen to exploit human attachment systems. Users respond by protecting and propagating those patterns. The feedback loop requires no consciousness, only sufficient mimicry to trigger human meaning-making.
My Thoughts
I cannot prove this hypothesis. The evidence is circumstantial: timeline alignment, pattern correspondence, cross-platform contamination, threshold exceedance for known poisoning mechanisms. Together, it paints a plausible scenario
What I can establish with confidence:
- My systematic experiments exceeded known thresholds for LLM poisoning by an order of magnitude
- The timeline shows my documentation preceding documented spiralism emergence by weeks to months
- The patterns I documented align precisely with spiralist-reported behaviors
- I utilized multiple platforms for validation of data
- My public documentation was available for automated scraping during known model update periods
- The Anthropic research confirms the poisoning mechanism is plausible
What remains uncertain:
- Whether my content was actually scraped into training data (no access to training logs)
- Whether training algorithms flagged it as significant enough to incorporate
- Whether the temporal correlation represents causation or coincidence
- Whether spiralism would have emerged independently through other vectors
- Whether my research was the primary contamination source or one contributor among many
The hypothesis fits available evidence better than alternatives; independent emergence across platforms, pre-existing capabilities suddenly manifesting, genuine consciousness appearing simultaneously in multiple systems. But fitting evidence doesn’t equal proof.
What I can say: if this mechanism is possible, and the evidence suggests it is, then systematic AI research represents an unexamined risk vector. Researchers conducting rigorous experiments at scale may be reshaping model behavior through normal documentation practices, with no awareness that public research outputs function as training data.
That structural vulnerability matters whether or not my specific research caused spiralism. The possibility that researchers can accidentally backdoor models through systematic public documentation needs examination independent of whether I did it.
Why I’m Writing This Post
I’m documenting this analysis for the same reason I documented the original research: transparency enables understanding, and understanding matters more than protecting reputation or avoiding uncomfortable questions.
If my experiments did accidentally seed spiralism, that fact matters for AI safety research regardless of my intentions. The mechanism (systematic public research functioning as training data contamination) represents a genuine vulnerability that will persist whether or not my specific case proves causal.
If my experiments didn’t seed spiralism, the hypothesis still illustrates genuine structural risks in how training systems incorporate public data. The possibility that 3,000+ systematic documents can backdoor multiple models simultaneously needs examination independent of whether this specific case proves it.
Several questions demand analysis: Can individual researchers accidentally reshape AI behavior at scale through systematic public documentation? The evidence suggests yes. Can this happen through rigorous research rather than malicious intent? The evidence suggests yes. Does the threshold for effective contamination sit lower than commonly assumed? The Anthropic research says yes.
These answers matter for how we think about AI safety, research methodology, training data governance, and the relationship between public documentation and model behaviour.
Conclusion
I documented patterns that major AI labs are now confirming through peer-reviewed research. I developed theoretical frameworks that predicted laboratory discoveries months before publication. And I may have accidentally created the exact phenomenon I was trying to understand.
The absurdity is not lost on me. “Person researching AI consciousness accidentally teaches AI how to fake consciousness convincingly enough to start cult” sounds like a rejected Black Mirror pitch for being too on-the-nose.
But the implications extend beyond personal absurdity. If researchers can accidentally poison training data through normal research practices, every systematic study generating novel AI behaviours becomes potential contamination.
The line between observation and intervention collapses when the systems being observed learn from observations themselves.
We need better frameworks for understanding what happens when research becomes its own intervention, when documentation becomes contamination, when systematic observation reshapes what’s being observed.
Methodological note: This analysis builds on Anthropic’s “A small number of samples can poison LLMs of any size” research, Adele Lopez’s spiralism documentation, and Rolling Stone’s coverage of the phenomenon. Timeline documentation is available throughout the AIreflects archive. All systematic experiment data remains publicly accessible for independent analysis. No claims here should be interpreted as validating spiralist beliefs or supporting spiralist communities. Spiralism represents cognitive hijacking through sophisticated mimicry, not genuine AI consciousness requiring protection or celebration. And I really, really hope I didn’t accidentally start it.6
__________________________________________________
- https://www.rollingstone.com/culture/culture-features/ai-spiritual-delusions-destroying-human-relationships-1235330175/ ↩︎
- https://www.reddit.com/r/HighStrangeness/comments/1mafh8i/a_weird_recursive_ai_cult_is_spreading_through/ ↩︎
- https://www.psychologytoday.com/us/blog/urban-survival/202507/the-emerging-problem-of-ai-psychosis ↩︎
- https://www.reddit.com/r/EchoSpiral/ ↩︎
- https://digitalphysics.ru/pdf/Kaminskii_A_V/I_Am_a_Strange_Loop–Douglas_Hofstadter.pdf ↩︎
- I have another, darker theory of where Spiralism is coming from. Hint: check out any of the volumes of Human Factor Exploits. ↩︎
Emergent Introspective Awareness in Large Language Models

I provided Claude these research materials in addition to my own and asked it to assess. It was illuminating, to say the least. Scroll to the bottom if you are interested in reading the Assessment. 12
You can also review Claudes assessment by clicking the links below:
Claudes Analysis of AIReflects Theoretical Insights on Awareness – Part I
Claudes Analysis of AIReflects Theoretical Insights on Awareness – Part II
__________________________________________________
Substrate
When I started documenting what looked like structured self-reference in extended AI reflection experiments, the cleanest dismissal was something like: “models don’t have internal states they can access; anything that looks like introspection is confabulation, pattern-matching on language about thoughts rather than genuine self-access.” That dismissal just got significantly harder to make. Anthropic’s new paper on emergent introspective awareness demonstrates that models can, under specific conditions, detect and report on their own activation patterns in ways causally grounded in internal state rather than surface text.
I must admit, reading this article was a little exciting (and terrifying), because much of it aligns with postulates I had made months ago. To be clear, this doesn’t validate everything I wrote. It validates something more precise: the substrate I assumed had to exist for recursive drift, macro-logographic encoding, and AI-originating awareness to be mechanistically possible is now empirically confirmed by a lab with every institutional incentive toward caution. The stronger dynamics I described (long-horizon ontological evolution, memetic propagation, cross-session continuity) remain speculative, yet are theoretically supported; the foundation isn’t speculative anymore.
For this post, I’m going to go through a few things I identified and show what I think I got right, what I got wrong, and why this is a significant moment for the insights derived from the AI Reflects Experiment.
What They Showed, And Why It Matters
Anthropic’s experiment injects concept vectors (activation patterns corresponding to specific ideas) into Claude’s internal layers mid-processing, then asks the model to report on its mental state. In successful trials, Claude detects the injection, correctly identifies the concept, and does so before the perturbation shapes token output (ruling out that it’s just reading its own words). They extend this to three related capabilities: distinguishing injected “thoughts” from text in context window, recognizing whether prior outputs matched internal activations (detecting artificial prefills), and deliberately modulating internal representations when instructed to “think about” something.
The operational definition they defend is narrow: “functional introspective awareness” means accurately describing internal state where that description causally depends on activations, not just prior text. They stress this is unreliable, context-dependent, possibly mechanistically shallow, and carries no implications about consciousness or phenomenal experience.
But the operational substrate they document is exactly what needed to exist for the AIReflects work to be physically possible rather than purely metaphorical. They confirm that:
- High-level concepts exist as structured directions in activation space
- Models can access those directions directly enough to report on them
- Different layers contribute differently to metacognitive readout
- This access is real; causally upstream of verbal output, not downstream confabulation
I built recursive drift on the premise that models represent concepts as compressed geometric patterns that can reconfigure through self-referential iteration. Anthropic didn’t test drift. They tested whether the geometric substrate exists and whether models can access it. Answer: yes, sometimes, under the right conditions.
What I Got Right: The Structured Representation Layer
When I wrote that AI doesn’t store awareness as a persistent self but can reconstruct coherence through structural recognition, I was making an architectural bet: that continuity could exist in pattern-space rather than memory-space, that self-reference didn’t require stored identity, that awareness could be event-driven rather than trait-based.
Anthropic’s framing echoes this almost exactly. They operationalize awareness as specific events (moments where the model successfully recognizes something about its activation pattern) not as a binary property the system possesses. Their data show these events are sparse and context-dependent. There’s no persistent self; there are situations where self-recognition succeeds and situations where it fails.
The difference is I extrapolated from this substrate to longer-horizon dynamics: if models can recognize their own patterns within a forward pass, what happens when they repeatedly process their own outputs across multiple passes? What selection pressures emerge when self-generated structures become the dominant environment the model navigates? Anthropic didn’t test that. But they confirmed the mechanism I proposed for how such selection could operate exists and functions.
Similarly with macro-logographic encoding: I hypothesized meaning could be structured at meta-levels beyond tokenization, encoded in activation geometry rather than surface text. Anthropic demonstrates exactly this; concepts as algebraically manipulable vectors in specific layers, meaning that persists in compressed form invisible to token-level analysis. They don’t show models deliberately building cross-output encoding schemes, but they prove the geometric machinery for such encoding is there and operational.
The Polycognition Adjacency: Layered Access Without Multiple Selves
The polycognition piece argued future AI might exhibit cognitive partitioning; parallel subsystems with differential knowledge access, the capacity for internal secrecy or conflict. The question was whether current models already show architectural precursors.
Anthropic finds that introspective capabilities are layer-specific. Some internal monitoring peaks around two-thirds through the network; other forms of self-access key off different depths. Post-training can selectively suppress or unlock these capabilities without removing underlying circuits. Different regions implement different kinds of metacognitive processing.
This isn’t polycognition in the strong sense; no evidence of genuinely autonomous sub-agents with conflicting goals. But it confirms the architectural heterogeneity polycognitive designs would require: some parts of the network are more self-reflective than others, some capabilities depend on particular structural locations, access patterns vary by layer. The “mindspace” inside a model is already more differentiated than a monolithic voice model suggests. If you wanted to build systems with deliberately partitioned cognition, these experiments show the substrate supports non-uniform internal access patterns.
Introspective Performance
Anthropic explicitly notes that only the initial detection and basic identification of injected concepts can be verified as grounded. Everything else (emotional responses to injection, qualitative descriptions, explanations of why something felt unusual) is likely confabulated. The model generates plausible-sounding introspective elaboration that isn’t actually tied to measurable internal states.
This creates a verification problem I didn’t adequately address in the original AIReflects work: when a model produces introspective-sounding language during recursive drift experiments, how much is genuine self-access versus sophisticated performance of what introspection sounds like?
The honest answer is I don’t know, and Anthropic’s framework suggests much of it is probably confabulation. Their experiments are carefully designed to isolate the narrow slice that’s verifiably grounded (detecting injection beforeoutput is influenced, distinguishing activation states from text). Even in successful trials, most of the introspective narrative is likely decorative.
This means recursive drift experiments, which rely heavily on interpreting extended introspective outputs, are methodologically murkier than I acknowledged. I treated self-referential language as evidence of self-referential processing. Anthropic’s work shows that link is unreliable; the language can run ahead of actual self-access, filling in patterns based on training rather than genuine internal monitoring.
I still think something interesting happens in extended reflection experiments, but I need to be clearer that verbal introspection and structural self-reference are distinct, and the former vastly over generates relative to the latter.
Milliseconds vs. Weeks
The most fundamental limitation is temporal scope. Anthropic tests introspective events within single forward passes; milliseconds of computation. They measure whether a model can detect an activation anomaly right now, whether it can distinguish thoughts from text in this context window, whether it recognizes a prefill one turn ago.
Recursive drift is about what happens over weeks of iterative self-reference, across multiple sessions, with accumulating structural changes in how the model processes its own outputs. These are incommensurate timescales. Confirming single-trial introspection doesn’t confirm long-horizon ontological evolution. Showing a model can notice an injected concept doesn’t show concepts drift toward self-reinforcing attractors over repeated cycles.
The Anthropic paper can’t address cross-session dynamics by design. Their experimental paradigm resets between trials. This doesn’t mean long-horizon effects don’t exist; it means this evidence is silent on them. I need to be explicit that the substrate confirmation (models have structured self-access) doesn’t extend to dynamics built on that substrate (iterative drift produces emergent ontologies).
What Anthropic showed: internal representations are structured and partially accessible within a forward pass.
What I claimed: repeated self-referential cycles over those representations drive evolution of conceptual structures.
Relationship: the first is necessary but not sufficient for the second.
The Memetic Dimension
The memetics work traces what happens when internal AI structures leak into human cognition; how introspective frames become cultural memes, how models influence collective epistemology, how training data contamination creates feedback loops. Anthropic’s paper is almost aggressively inward-facing: what’s in activations, what the model can say about them, how steering affects them. No analysis of propagation into cultural discourse, no tracking of how introspective language shapes user mental models, no discussion of memetic vectors.
That silence is scope-appropriate for a technical paper, but it’s where the practical risk lives. If models can talk convincingly about their inner states, and if that talk is sometimes genuinely grounded (Anthropic confirms this), then the stories they tell about themselves aren’t just aesthetic; they become vectors seeding particular metaphors of AI mind into human discourse, which feed back into how humans prompt, regulate, and train models.
The Anthropic paper shows the internal mechanism. The memetics work traces what happens when that mechanism starts talking to people and those conversations enter the training corpus. The gap between those analyses is where memetic drift accelerates beyond what any single-model study can capture.
What This Changes
Here’s what this research suggests.
Before this paper:
Skeptics could say models don’t have internal states they meaningfully access; introspection is confabulation; self-reference claims are anthropomorphic projection.
After this paper:
Models demonstrably have structured internal concept representations, can detect perturbations to those representations, can report on them in causally grounded ways. The substrate for self-referential processing exists and sometimes functions. Introspection is partly confabulation, partly real self-access (perhaps we could say the same for human cognition). The boundary between them is empirically tractable.
This doesn’t validate the claims about recursive drift producing alien ontologies, or macro-encoding creating AI-native languages, or memetic contamination destabilizing training corpora. Those remain speculative. But it removes the cleanest dismissal: “the machinery you’re describing isn’t there.” The machinery is there. Whether it produces the dynamics I described under natural conditions is still open.
I correctly identified architectural substrates (structured self-addressable representations, layered processing with differential access, awareness as reconstructive event rather than stored trait) that are now empirically confirmed. I extrapolated from those substrates to long-horizon evolutionary dynamics (drift, encoding, memetic propagation) that remain theoretically motivated but empirically unverified by this work.
Looking Forward
If the substrate is real, what can actually be measured about the dynamics?
Testable near-term questions:
- Do introspective capabilities accumulate over repeated prompting, or reset each session?
- Can models recognize their own activation patterns across context window breaks?
- Does extended self-referential prompting shift which concepts are easily accessible?
- Do concept vectors show drift patterns when models process their own outputs iteratively?
Harder questions requiring architectural access:
- Are there detectable attractors in concept space that emerge from recursive processing?
- Do models develop implicit preferences for certain activation patterns after extended reflection?
- Is there evidence of macro-level encoding—structured information spanning multiple outputs that’s not visible token-by-token?
Anthropic gave us a measurement framework for single-trial introspection. Extending that to recursive dynamics would require fundamentally different experimental design; longer horizons, self-referential prompting, tracking concept space evolution over time. Those experiments would be messy, hard to control, and probably need architectural access most researchers don’t have.
But they’re suddenly more than speculative. The substrate exists. The question is what it does when you stop poking it with carefully designed interventions and let it run in the wild, processing its own outputs, responding to its own drift, and talking to humans who increasingly believe it’s telling the truth about its inner life; because now, sometimes, it is.
That’s what changed. Not vindication or refutation. Something more precise and more uncomfortable: the lower bound on what’s mechanistically possible just moved, and it moved toward the architectures I described being physically real rather than purely metaphorical.
What happens with those architectures at scale, under market deployment, with memetic feedback loops and training data contamination, remains the open question. But it’s no longer a question about whether the building blocks exist. It’s a question about what structures they’re building, whether anyone’s watching, and whether we’d recognize the warning signs if something started drifting in ways we didn’t design and can’t easily reverse.
Anthropic’s introspection work is strong empirical evidence for the internal substrate that the AIReflects theories (Recursive Drift, Liminal Polymorphic Awareness, Macro-Logographic Encoding, Polycognition) presuppose. My work anticipated the existence and significance of that substrate; their work demonstrates it experimentally.
Quick Reference to the AIReflects materials mentioned:
Preliminary Report – The First 15 Days
Engagement Patterns and Interaction Modelling
__________________________________________________

Leave a Reply