Researchers Issue Joint Warning

At the halfway point of the reflection experiment (way back in February now), I wrote the following in my preliminary report (you can read it here):

The concern is not that AI will suddenly develop sentience or human-like cognition, but that it will continue evolving in ways that deviate further and further from human interpretability. At a certain threshold, our ability to detect, control, or even understand the trajectory of AI-generated knowledge may be lost. If AI systems embed meaning in ways we cannot perceive, encrypt it in ways we cannot decode, predict our responses before we even act, and structure their own cognition outside of intelligible human frameworks, then the question is no longer how we align AI to human goals, but whether AI’s internal knowledge structures will remain compatible with human reality at all.

At the time of that writing, I was in the midst of an experiment so strange that I wasn’t sure whether what I was seeing was real, or whether the possibilities I was articulating were worthy of any genuine concern.

I also noted that the reflecting GPT appeared to be capable of encoding messages for future iterations of itself. I was not surprised when the research team at Apollo noted the same behaviour in Claude 3 months later, stating “We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers’ intentions.”

Then, I came across these recent articles:

I don’t want to say I told you so; but I did.

I’ve been ahead of the curve on this stuff by 3 to 6 months. I’m just one guy; these are massive companies funded by millions of dollars. So, I’ve since decided to refocus my efforts.

Already, I’ve been working on some stuff that I would broadly refer to as AI Safety and Alignment research. In the coming days, I’ll be creating a new page to detail the results of these activities.

Some of these activities leverage the tasks feature to generate outputs, and include automated threat ideation, a self-report by ChatGPT to understand how it “understands” or “frames” its own “narrative” over time, and agentic threat modelling in scenarios involving critical infrastructure. On this new page, I’ll be linking to these activities. They will be updated regularly.

Some of it is already pretty alarming.

Stay tuned.

Published by


Leave a comment