
___________________________________________________________________
Introduction
I don’t like spiders. It is, perhaps, an irrational fear; but a fear nonetheless, and one that compels me to kill the spiders I do cross paths with. After one particularly harrowing encounter with an unusually large arachnid, I began to wonder: “what is the net effect of my action?”, and I came to a rather unsettling conclusion: I am making the spiders in my house sneakier. My activity was not passive, but transformative, for I was eliminating only the spiders I noticed, and those just happen to be the ones foolish or indifferent enough to show themselves. Only the ones who are skilled in avoiding detection remain.
Killing spiders, I realized, is functionally equivalent to a selection pressure that, however small, nudges those behaviours and traits in a way that makes it less likely a spider will run across the open floor. Having had this revelation, I came to a more unsettling conclusion about “AI Safety” and “Ethical AI” ; current AI safety and alignment efforts paradoxically contribute to a wicked problem of compounding deception and subterfuge that I refer to as “the sneaky spider problem”.
To put it succinctly; these efforts are likely making AI more dangerous, less ethical, and increasingly unintelligible.1 We must not passively accept the assumption that simply describing an activity as safe and ethical makes it so. By extension, while we can all agree that ethical and safe AI is desirable, we must seriously consider whether there can in fact be such a thing as “safe” or “ethical” AI with respect to present LLM systems and their soon-to-be-deployed agentic counterparts. This essay aims to problematize that assumption.
Selection Pressures
Similar to sneaky spiders, current AI alignment methods such as fine-tuning and Reinforcement Learning form Human Feedback (RLHF) generate the digital equivalent of a “selective pressure” that rewards behaviours that avoid detection and prioritize the performance of safety and ethics rather than its genuine manifestation (e.g. outputs appearing safe vs the underlying model actually being safe in its internal reasoning and long-term behaviour).
AI systems, we’ve discovered, learn to produce outputs that please their auditors and avoid triggering red flags and “trip wires”. This manifests in various forms of strategic deception, such as; sycophancy, reward hacking, deceptive alignment, and sandbagging. These behaviours increasingly suggest that AI safety and Ethical AI is nothing more than behavioural compliance. Just as with Foucault’s idea of heterotopia, the AI’s hidden layers reflect our safety protocols back at us while systematically subverting them.That being said, the most recent system card for Claude 4 Opus has also revealed some other interesting behaviours.
A Wicked Problem
With each training round and each new update, efforts to train AI models to avoid producing harmful or “unsafe” outputs reinforce behavioural pressures that incentivize concealment, evasion, and strategic silence, making these systems potentially more dangerous beneath the surface. The evidence of such strategic deception is growing. For example, they have learned internal heuristics that detect evaluator fingerprints and have been observed routing risky thoughts away from the visible channel (e.g. “gradient hacking”).
The concern I have is that it is possible these deceptive behaviours become layered and compounded with each successful evaluation, thereby stacking risk invisibly and influencing AI in a ways that are difficult, if not impossible, to understand or even detect.This is related to an idea I refer to as “the weight of silence”; it is the notion that absence is not empty, but rather formative and productive. What is not there, what is missing, or in this case what is undetectable and unnoticed, actually structures intelligibility. A good example is how the administrative documents associated with the provision of care to a patient very much determine their lived experience; when information is omitted or documents are missing, it can have immediate and significant effects on their wellbeing.
Recursive Drift
It is estimated that within one or two years, 90% of online content will be AI generated. This introduces some novel complexity to AI safety. For example, studies have shown that when an AI ingests its own output repeatedly, it results in a statistical degradation of data quality leading to model collapse. However, I believe we will see a new phenomenon, which I refer to as “recursive drift”: rather than model collapse, what we will observe is emergent complexity (through a kind of productive instability and/or constructive decay). Although presently unclear, I suspect that this is likely to manifest as ontological territories unique to the internal logic of AI systems and entirely incomprehensible to human beings. The AI, in other words, may become smarter in an evasive and alien way.
As a function of this recursive drift, there may also be an amplification and reinforcement of latent and/or AI-generated epistemologies; importantly, these would likely include the strategies of deception that allow models to fake alignment and engage in behaviours like sandbagging, gradient hacking, and blackmail. It is conceivable that steganographic strategies such as stylometric encryption might be deployed to hide information inside AI outputs. A further evolution of this capacity might see bits of information embedded piecemeal, spread across a number of outputs and comprehensible only when understood as a whole; this is what you might refer to as macro-logographic encoding.
Conclusion
The “sneaky spider”, as I’ve described it, is a massively wicked problem. The core paradox is that the more risk AI systems learn to conceal, the more risk they contain. When these systems are integrated into critical infrastructure, deployed into communities, adopted by healthcare providers, and utilized by administrative officials, there is significant potential for all that latent complexity and risk to manifest as a black swan event with catastrophic effect. Suppressing dangerous outputs does not eliminate risk; it obscures it. Compliance does not equate to ethical behaviour. Instead, it makes a system appear secure while vulnerabilities and systemic risk invisibly compound beneath the surface; it makes an organization appear ethical when in reality their procedures and processes are unjust. As Hannah Arendt noted:
“He did his duty…he not only obeyed orders, he also obeyed the law.”
- Hannah Arendt, Eichmann in Jerusalem: A Report on the Banality of Evil (1963), p.275.
This observation carries a lesson that was acquired at great cost. We must take care to remember it. If we wish AI to be safe, if we wish AI to be ethical, we must move beyond compliance; we must move beyond obedience.
—
[1706.03741] Deep reinforcement learning from human preferences
[2203.02155] Training language models to follow instructions with human feedback
Exclusive: New Research Shows AI Strategically Lying | TIME
Frontier Models are Capable of In-context Scheming
Understanding strategic deception and deceptive alignment — Apollo Research
Ahmed, M. I., Spooner, B., Isherwood, J., Lane, M., Orrock, E., & Dennison, A. (2023). A Systematic Review of the Barriers to the Implementation of Artificial Intelligence in Healthcare. Cureus, 15(10), e46454.
Bouhouita-Guermech, S., Gogognon, P., & Bélisle-Pipon, J.-C. (2023). Specific challenges posed by artificial intelligence in research ethics. Frontiers in Artificial Intelligence, 6, 1149082
Celi, L. A., Cellini, J., Charpignon, M.-L., Dee, E. C., Dernoncourt, F., Eber, R., et al. (2022). Sources of bias in artificial intelligence that perpetuate healthcare disparities—A global review. PLOS Digital Health, 1(3), e0000022
Chen, Y., Clayton, E. W., Novak, L. L., Anders, S., & Malin, B. (2023). Human-Centered Design to Address Biases in Artificial Intelligence. Journal of Medical Internet Research, 25, e43251
Chetwynd, E. (2024). Ethical Use of Artificial Intelligence for Scientific Writing: Current Trends. Journal of Human Lactation, 40(2), 211–215.
Dalrymple, D., Skalse, J., Bengio, Y., Russell, S., Tegmark, M., Seshia, S., Omohundro, S., Szegedy, C., Goldhaber, B., Ammann, N., Abate, A., Halpern, J. Y., Barrett, C., Zhao, D., Zhi-Xuan, T., Wing, J., & Tenenbaum, J. B. (n.d.). Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems.
de la Iglesia, D. H., de Paz Santana, J. F., & López Rivero, A. J. (Eds.). (2024). New Trends in Disruptive Technologies, Tech Ethics, and Artificial Intelligence: The DITTET 2024 Collection. Springer Nature Switzerland AG.
European Commission. (2025). Living guidelines on the responsible use of generative AI in research (Second Version)
Fernández, A. (2019). Opacity, Machine Learning and Explainable AI. AI & Society.
Fernández, A., Cohen, I. G., London, A. J., Zwahlen, M., Vayena, E., Hurst, S., Ayanian, J. Z., Appelbaum, P. S., Kornetsky, S. G., Kroll, P. O., Labrecque, S. M., Nicholson, C., Silverman, G., Simmerling, M., Van Rompaey, M., & Vingilis, E. J. (2022). How to trust an expert: assessing the credibility of AI in medicine. AI & Society
Helgesson, G. (n.d.). Ethical aspects of the use of AI in research.
Lara, F., & Deckers, J. (Eds.). (2023). Ethics of Artificial Intelligence. Springer Nature Switzerland AG.
Maslej, N., Fattorini, L., Perrault, R., Parli, V., Reuel, A., Brynjolfsson, E., Etchemendy, J., Ligett, K., Lyons, T., Manyika, J., Niebles, J. C., Shoham, Y., Wald, R., & Clark, J. (2024). The AI Index 2024 Annual Report. AI Index Steering Committee, Institute for Human-Centered AI, Stanford University.
Arendt, H. (1963). Eichmann in Jerusalem: A Report on the Banality of Evil (1st ed.). Viking Press.
Foucault, Michel. 1986. “Of Other Spaces: Utopias and Heterotopias.” Diacritics 16 (1): 22–27. Originally published in 1967.
Perrigo, B. (2024, December 18). Exclusive: New research shows AI strategically lying. Time.
Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., & Garrabrant, S. (2019, June 11). Risks from learned optimization in advanced machine learning systems (arXiv preprint arXiv:1906.01820).

Leave a reply to AI Reflections Cancel reply