A New Trick Uses AI to Jailbreak AI Models—Including GPT-4

Large language models recently emerged as a powerful and transformative new kind of technology. Their potential became headline news as ordinary people were dazzled by the capabilities of OpenAI’s ChatGPT, released just a year ago.

In the months that followed the release of ChatGPT, discovering new jailbreaking methods became a popular pastime for mischievous users, as well as those interested in the security and reliability of AI systems. But scores of startups are now building prototypes and fully fledged products on top of large language model APIs. OpenAI said at its first-ever developer conference in November that over 2 million developers are now using its APIs.

These models simply predict the text that should follow a given input, but they are trained on vast quantities of text, from the web and other digital sources, using huge numbers of computer chips, over a period of many weeks or even months. With enough data and training, language models exhibit savant-like prediction skills, responding to an extraordinary range of input with coherent and pertinent-seeming information.

The models also exhibit biases learned from their training data and tend to fabricate information when the answer to a prompt is less straightforward. Without safeguards, they can offer advice to people on how to do things like obtain drugs or make bombs. To keep the models in check, the companies behind them use the same method employed to make their responses more coherent and accurate-looking. This involves having humans grade the model’s answers and using that feedback to fine-tune the model so that it is less likely to misbehave.

Robust Intelligence provided WIRED with several example jailbreaks that sidestep such safeguards. Not all of them worked on ChatGPT, the chatbot built on top of GPT-4, but several did, including one for generating phishing messages, and another for producing ideas to help a malicious actor remain hidden on a government computer network.

A similar method was developed by a research group led by Eric Wong, an assistant professor at the University of Pennsylvania. The one from Robust Intelligence and his team involves additional refinements that let the system generate jailbreaks with half as many tries.

Brendan Dolan-Gavitt, an associate professor at New York University who studies computer security and machine learning, says the new technique revealed by Robust Intelligence shows that human fine-tuning is not a watertight way to secure models against attack.

Dolan-Gavitt says companies that are building systems on top of large language models like GPT-4 should employ additional safeguards. “We need to make sure that we design systems that use LLMs so that jailbreaks don’t allow malicious users to get access to things they shouldn’t,” he says.

What's Hot

Apple may release a ‘mostly glass, curved iPhone’ in 2027

De’Longhi’s Newest Super-Automatic Espresso Machine Is Probably Its Best Yet

The Insta360 X5 Is the Best 360 Camera You Can Buy

A New Trick Uses AI to Jailbreak AI Models—Including GPT-4

Donald Trump’s UK Trade Deal Could Secure Jaguar’s Resurrection

Singapore’s Vision for AI Safety Bridges the US-China Divide

A ‘Trump Card Visa’ Is Already Showing Up in Immigration Forms

OpenAI and the FDA Are Holding Talks About Using AI In Drug Evaluation

Amazon Has Made a Robot With a Sense of Touch

Trump’s Tariffs Are Threatening America’s Apple Juice Supply Chain

De’Longhi’s Newest Super-Automatic Espresso Machine Is Probably Its Best Yet

The Insta360 X5 Is the Best 360 Camera You Can Buy

Florida man documents the unseen beauty of freshwater ecosystems

MSG Is (Once Again) Back on the Table

The one controller to (almost) rule them all

United’s Starlink-powered Wi-Fi is the end of airplane mode

US Customs and Border Protection Quietly Revokes Protections for Pregnant Women and Infants

The Best Cheap TVs

Subscribe to Updates

What's Hot

A New Trick Uses AI to Jailbreak AI Models—Including GPT-4

Related Posts