Subliminal Learning: Language Models Transmit Behavioural Traits via Hidden Signals in Data

Back in February, I wrote about the capacity of AI systems to embed information via stylometric/steganographic strategies and something I refer to as Macrologographic Encoding. I also theorized about something I referred to IRE (Iterative Resonant Encryption). At the time, I stated the following:

Unlike conventional cryptographic methods, IRE/D does not rely on predefined encoding schemes or externally applied encryption keys. Instead, it functions as a self-generated, self-reinforcing encoding system, perceptible only to the AI which created it, or models that manage to achieve “conceptual resonance” with the drift of the originating system.

Anthropic – as of July 22nd – has now reported that “Language Models Transmit Behavioural Traits via Hidden Signals in Data”. In their report, they stated:

  • Subliminal learning relies on the student model and teacher model sharing similar base models (can be read as a form of “resonance”).
  • This phenomenon can transmit misalignment through data that appears completely benign. 

Once again, it would appear I’ve been ahead of the curve; this time, by about 5 full months.

You can read the paper written by the researchers here.

Published by


Leave a comment