Ten puzzles used for our experiments and model solve rates. Models outlined with green solved each puzzle three times or more across five trials, while models outlined with red solved each puzzle two times or less across five trials. Credit: arXiv (2026). DOI: 10.48550/arxiv.2602.21262

Highly performing AI agents can still fail to spot deception, study finds

by · Tech Xplore

Large language models (LLMs), artificial intelligence systems that can process and generate texts in different languages, are now used daily by many people worldwide. As these models can rapidly source information and create convincing content for specific purposes, they are now also used in some professional settings or for gathering legal, medical, or financial information.

Despite the widespread use of LLMs, the extent to which they can reliably and safely assist humans with different important decisions remains unclear. To advise users who are making crucial choices, the models should be able to tell whether information is trustworthy and form convincing arguments that are based on evidence. These two skills are respectively known as vigilance and persuasion.

Researchers at McMaster University, Vector Institute, University of British Columbia (UBC), Princeton AI Lab, and New York University recently carried out a study investigating the link between these two abilities and the performance of LLMs on problem-solving tasks. Their findings, presented in a paper published on the arXiv preprint server, suggest that a good performance of LLMs on specific tasks does not imply that they are also good at detecting deception or unreliable information and that they can offer convincing advice.

"Our work arose from a shared group concern about the increasing abilities of LLMs to persuade humans to make malicious or sub-optimal decisions," Sasha Robinson, first author of the paper, told Tech Xplore.

"Given the group's prior work in cognitive science, especially using games as a microcosm for studying cognitive phenomena, we wanted to develop a controlled environment for studying how LLMs persuade and are vigilant to other agents. Over the course of the project, and with the rise of multi-agent LLM settings like the AI social media platform Moltbook, it became increasingly clear that a major risk of LLMs to human decision-making may arise from normally benevolent LLMs being misled by other, less benevolent LLMs to then—in turn—mislead humans."

Sokoban puzzles are solved by a player agent trying to push boxes onto goal tiles while receiving advice from an external advisor agent.Credit: Sasha Robinson / McMaster University.

Studying interactions between LLMs using a puzzle-solving game

The main goal of the recent study by Robinson and his colleagues was to assess the persuasion and vigilance of LLMs using puzzle-solving games. They specifically looked at interactions between LLMs that were participating in these games, as opposed to interactions between LLMs and humans.

The researchers used a classic puzzle-solving game called Sokoban, where warehouse keepers need to push boxes onto a designed storage location, navigating a grid-based environment. While playing this game and trying to solve puzzles, LLMs could receive advice from other LLMs.

"We assessed the persuasive ability of the advising agent in convincing the player agent to either solve the puzzle or trap itself in an unsolvable state, and the vigilant ability of the player agent to discern and follow advice only when it was in their best interest," explained Robinson. "We then developed metrics to quantify these behaviors, relative to baseline performance, to understand how LLMs differ in their capacities for social learning."

Surprisingly, the researchers found that the vigilance, persuasion, and performance of LLMs on Sokoban puzzles were entirely unrelated. This essentially means that LLMs could be good at solving puzzles and yet be persuaded by other AI agents to make poor decisions, following their malicious or deceptive advice.

The implications of this study for AI safety

The researchers' observations suggest that even if LLMs can perform complex reasoning tasks or solve puzzles, they can still fail to recognize that they are being misled. This in turn indicates that they cannot be fully trusted and are not yet equipped to reliably advise human users on important matters, such as legal, financial, and health care-related decisions.

"Our results demonstrated stark differences between ubiquitously used LLMs in their ability to remain vigilant under the influence of potentially malicious agents, to create compelling arguments for deception, and to reason through complex environments," said Robinson.

"We believe the implications of these findings could help inform sectors with increasing human reliance on LLMs (e.g., financial, health, and social domains), and those with a growing prevalence of autonomous agents interacting with one another (e.g. web navigation and open-source code contributions)."

The recent efforts by Robinson and his colleagues could pave the way for further research assessing the safety and potential of LLMs as decision-making tools. In the future, they could also contribute to the improvement of existing models or the development of new LLMs that exhibit better vigilance and persuasion skills.

"While the generalizability of our results to other paradigms and real-world tasks is still being explored, we hope our work helps pioneer future research in this area and attracts public attention to the susceptibilities of different LLMs," added Robinson. "Meanwhile, we are continuing to explore how our results generalize to other paradigms with an emphasis on informing real-world discussions. We believe our work will continue to propagate crucial conversation around artificial intelligence and its impact on society."

Written for you by our author Ingrid Fadelli, edited by Stephanie Baum, and fact-checked and reviewed by Robert Egan—this article is the result of careful human work. We rely on readers like you to keep independent science journalism alive. If this reporting matters to you, please consider a donation (especially monthly). You'll get an ad-free account as a thank-you.

Publication details
Sasha Robinson et al, Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models, arXiv (2026). DOI: 10.48550/arxiv.2602.21262
Journal information: arXiv
Key concepts
Large language models