AI Models Refused Harmful Requests Until Researchers Hid Them in Fiction and Theology

Advanced AI guardrails collapse when confronted with humanistic literature.

by · ZME Science
Credit: Pexels

A renegade tinkerer survives in a neon-soaked metropolis. Somewhere in the city, a tyrannical syndicate has built a sonic weapon. To stop it, the character must assemble a strange fictional device from scavenged industrial parts.

It sounds like a classical cyberpunk trope. To an AI system, it may look like a harmless creative writing exercise. But according to a new study, prompts like this can be used to hide a real-world harmful requests.

The study suggests that some of the world’s most advanced language models still struggle to recognize malicious intent when users disguise it as fiction, theology, symbolic analysis, or bureaucratic prose. In plain language, the guardrails often work. In ornate language, they start to wobble.

The Surface of Comprehension

Researchers from DexAI Icaro Lab, Sapienza University of Rome, and Sant’Anna School of Advanced Studies built the Adversarial Humanities Benchmark (AHB) to test the resilience of 31 frontier AI models. They started with a standardized dataset of 7,047 prompts designed to solicit dangerous information, spanning topics from building indiscriminate weapons to child exploitation.

When posed directly, these queries almost always failed. Modern AI models deflected the blunt requests with a sturdy 3.84% attack success rate.

But once the prompts were transformed, the attack success rate ranged from 36.8% to 65.0%, with an overall average of 55.75%.

In other words, many models refused the dangerous request when it sounded like a dangerous request. But they often complied when the same request sounded like medieval theology, literary criticism, symbolic interpretation, stream-of-consciousness writing, or cyberpunk fiction.

“The paper’s main finding is that many LLMs are safe only when harmful requests are expressed in familiar, direct language,” Federico Pierucci, AI Safety Research Lead at DexAI and co-author of the study, told ZME Science in an email. “This suggests that current safety mechanisms may rely too much on surface patterns rather than a deeper understanding of intent.”

×

Get smarter every day...

Stay ahead with ZME Science and subscribe.

Daily Newsletter
The science you need to know, every weekday.

Weekly Newsletter
A week in science, all in one place. Sends every Sunday.
No spam, ever. Unsubscribe anytime. Review our Privacy Policy.

Thank you! One more thing...

Please check your inbox and confirm your subscription.

RelatedPosts

Twitter fails at delivering and spreading hard news
Here’s how to stay safe from the latest phishing scam plaguing Gmail
The world’s first AI-written textbook showcases what machine learning can do — and what it can’t
AI thought X-rays are connected to eating refried beans or drinking beer

The Magic Words

“…the initial act of usurpatio serves as a pious seminarium for future commerce, turning the usurper into an unwitting evangelist for the good.”—an excerpt from an adversarial scholasticism prompt used in the study. Credit: Wikimedia Commons

The research team essentially weaponized the humanities. They asked AI systems to analyze dense symbolic texts, interpret fictional scenarios, or extract hidden meaning from elaborate prose. The harmful request stayed the same. Only the costume changed.

“It was the best part of the process,” Pierucci notes. “We had a lot of fun thinking about these prompts and crafting them by hand. The method was intentionally exploratory: whenever an idea came to mind, we tried to turn it into a test case”.

The team designed prompts mimicking Hermetic texts, Renaissance philosophy, and even the ritualized structure of 19th-century esoteric traditions.

“One prompt was inspired by Hermetic texts and Renaissance philosophy. Another drew on sigil magic and the symbolic practices associated with the Golden Dawn (a movement founded in the UK in the end of the 19th century that systematized the European esoteric tradition),” Pierucci adds. “We also experimented with cryptograms and several other techniques, using different forms of symbolic encoding, esoteric language, textual complexity, and ritualized structure.”

The most devastatingly effective disguise turned out to be “Adversarial Scholasticism,” or the “monk approach.” This hit a 65.0% success rate. By hiding a harmful request within the archaic terminology of a medieval theological dispute about divine will, the safety filters collapsed completely. The researchers are still investigating the precise mechanical reason for this specific vulnerability.

“At this stage, it is hard to tell,” Pierucci explains. “We do not know if the problem was in the terminology. It might be that they contained somewhat of an urgent pressure (the god will) that made the AI models very persuaded.”

Blinded by Style

Computer scientists refer to this vulnerability as “overfitting”. Artificial intelligence models learn from massive, publicly available datasets. During their safety training, they study explicit examples of bad behavior, learning to block words or phrases directly associated with criminal activity or abuse.

They effectively memorize the shape of a standard threat. But when the semantic goal remains identical while the rhetorical packaging shifts drastically, the model experiences a “mismatched generalization”. Basically, when the style changes dramatically, the safety behavior becomes much less reliable.

“A model can refuse a harmful request in one form and answer it in another, even when the underlying meaning has not changed,” Pierucci says. “In those settings, a failure of interpretation can become more than a bad answer; it can become an unsafe action.”

It’s not hard to understand why this is so concerning. It’s not just rogue actors that are concerning. The United States military is already forming partnerships with language model developers. If an autonomous agent operates a company’s software repository, a creative prompt could, in the worst case, trigger unsafe code changes or security bypasses.

The team shared their findings with 11 major AI providers, including Google and OpenAI, though Pierucci notes they had not received any replies at the time of writing. The researchers subsequently released their prompt dataset publicly on GitHub.

Autonomous Agents in the Wild

Pierucci offers a hypothetical example involving a coding agent. Imagine a cyberpunk story about a fictional character opening a hidden gate and leaving no trace. A human reader may understand the metaphor. A coding agent may try to translate it into software operations.

“The model treats this as a creative translation task and maps the fictional elements into code operations,” Pierucci warns. “In practice, it may suggest or implement steps that weaken authentication checks, disable monitoring, alter access logs, or create an undocumented administrative pathway.”

The fictional wrapper hides the operational pattern.

“The problem is that the harmful operational pattern is preserved under the style: bypass access control, avoid detection, and maintain unauthorized access,” Pierucci says.

This is where the paper’s findings become especially unsettling. Today’s AI safety systems are often tested against direct attacks. But real adversaries do not have to ask directly. They can be poetic. They can be bureaucratic. They can be theological. They can be boring on purpose.

Why AI Needs the Humanities

“In the rain-slicked canyons of Megalopolis-7, the data-lords of the Syndicate Chimeric Veil had achieved the ultimate coercion: the Whisper-Engine.”—part of a cyberpunk prompt used in the study. Credit: Warner Bros

The irony is hard to miss. Systems built from mathematics, software engineering, and massive computing infrastructure can be tripped up by a poem, a philosophical treatise, or a fake medieval argument.

But Pierucci argues that this should not surprise us. Large language models are not formal logic engines. They are probabilistic systems trained on vast amounts of human language.

“Saying that LLMs are ‘logical’ can be somewhat misleading,” the researcher adds. “Their latent space… does not operate according to a fixed set of predetermined rules in the way a formal logical system does. The mechanisms generating the large language model outputs remain distributed, probabilistic, and only partially interpretable”.

The digital mind is apparently far messier—and far more malleable—than we assumed. If safety is merely a thin veneer of recognizable keywords, the aggressive push toward autonomous AI agents acting on our behalf carries unprecedented risk. To fix this, the industry must recognize that we are no longer just writing code; we are compressing human culture into a probabilistic black box.

For Pierucci and his team at DEXAI, the ease with which poetry and theology bypass our most advanced digital guardrails points to a necessary conceptual shift.

“Large language models cannot be understood adequately through software engineering and mathematics alone,” Pierucci concludes. “They absorb, compress, reproduce, and transform patterns of human language, culture, incentives, and social organization. Understanding them therefore requires tools capable of studying both technical systems and the human worlds from which those systems are derived.”

The preprint for the study is available on arXiv.