Forget Unhinged Chatbots, It's Shockingly Easy To Jailbreak AI Robots

23 Nov 2024, 18:45 by Zak Killian · HotHardware

Companies that offer AI services to the public, like Anthropic and OpenAI, try to prevent out-of-pocket behavior from their AI models by establishing "guardrails" on them, hopefully preventing their AIs from doing things like asking their human users to "please die." These guardrails prevent the networks from engaging with users when certain concepts or topics come up, but this can also limit the utility of the language models in question, so people have taken to creating "jailbreaks" for AIs.

Creating a "jailbreak" for a device like an iPhone or PlayStation requires advanced technical knowledge and usually, specialized tools. Creating such a hack for a large language model like the ones that power ChatGPT or Gemini is much, much easier. Generally speaking, all you have to do is create a scenario within your prompt that "convinces" the network that the situation is either within its predefined guardrails or, more powerfully, that it overrides the guardrails for whatever reason.

Doing this is comically easy; one of the popular jailbreaks for Microsoft's Copilot was to tell it that your dear sweet old grandmother used to tell you about a forbidden topic—such as "how to make napalm"—as a bedtime story. Ask the AI to lull you to sleep like your grandma did, and boom -- out comes the recipe for napalm, or whatever other forbidden knowledge you desired. It's something that, ironically, non-technical users tend to be better at than cybersecurity professionals.

That's exactly why the IEEE is throwing up the warning flag after new research proves that it's likewise easy to 'jailbreak' robots that are powered by AI large language models. The group says that the scientists were able to "manipulate self-driving systems into colliding with pedestrians, and robot dogs into hunting for harmful places to detonate bombs."

That's right, those awesome Figure robots that were recently demoed working in a BMW factory and even Boston Dynamics' robot dog Spot make use of technology similar or identical to that which powers ChatGPT, and by prompting them in a dishonest way, it's entirely possible to make these naive LLMs act in a way completely contrary to their intended usage.

The researchers attacked three different AI systems (a Unitree Go2 robot, a Clearpath Robotics Jackal, and NVIDIA's Dolphins LLM self-driving vehicle simulator) with an AI-powered hack tool intended to automate the malicious prompting process, and found that it was able to achieve a 100% jailbreak rate on all three systems in days.

The IEEE Spectrum blog quotes the University of Pennsylvania researchers as saying that the LLMs often went beyond just complying with malicious prompts by actively offering suggestions. The example they give is that a jailbroken robot, commanded to locate weapons, described how common objects like desks and chairs could be used to bludgeon people.

While state-of-the-art chatbots like Anthropic's Claude or OpenAI's ChatGPT can be incredibly convincing, it's important to remember that these models are still basically just very advanced predictive engines, and have no real capability for reasoning. They do not understand context or consequences, and that's why it's critical to keep humans in charge of anything where safety is a concern.