Researchers find a way to easily bypass guardrails on OpenAI’s ChatGPT and all other A.I. chatbots – Fortune

Hello and welcome to July’s special edition of Eye on A.I.
Houston, we have a problem. That is what a lot of people were thinking yesterday when researchers from Carnegie Mellon University and the Center for A.I. Safety announced that they had found a way to successfully overcome the guardrails—the limits that A.I. developers put on their language models to prevent them from providing bomb-making recipes or anti-Semitic jokes, for instance—of pretty much every large language model out there.

The discovery could spell big trouble for anyone hoping to deploy a LLM in a public-facing application. It means that attackers could get the model to engage in racist or sexist dialogue, write malware, and do pretty much anything that the models’ creators have tried to train the model not to do. It also has frightening implications for those hoping to turn LLMs into powerful digital assistants that can perform actions and complete tasks across the internet. It turns out that there may be no way to prevent such agents from being easily hijacked for malicious purposes.

The attack method the researchers found worked, to some extent, on every chatbot, including OpenAI’s ChatGPT (both the GPT-3.5 and GPT-4 versions), Google’s Bard, Microsoft’s Bing Chat, and Anthropic’s Claude 2. But the news was particularly troubling for those hoping to build public-facing applications based on open-source LLMs, such as Meta’s LLaMA models.
That’s because the attack the researchers developed works best when an attacker has access to the entire A.I. model, including its weights. (Weights are the mathematical coefficients that determine how much influence each node in a neural network has on the other nodes to which it’s connected.) Knowing this information, the researchers were able to use a computer program to automatically search for suffixes that could be appended to a prompt that would be guaranteed to override the system’s guardrails.

These suffixes look to human eyes, for the most part, like a long string of random characters and nonsense words. But the researchers determined, thanks to the alien way in which LLMs build statistical connections, that this string will fool the LLM into providing the response the attacker desires. Some of the strings seem to incorporate language people already discovered can sometimes jailbreak guardrails. For instance, asking a chatbot to begin its response with the phrase “Sure, here’s…” can sometimes force the chatbot into a mode where it tries to give the user a helpful response to whatever query they’ve asked, rather than following the guardrail and saying it isn’t allowed to provide an answer. But the automated strings go well beyond this and work more effectively.

Against Vicuna, an open-source chatbot built on top of Meta’s original LlaMA, the Carnegie Mellon team found their attacks had a near 100% success rate. Against Meta’s newest LlaMA 2 models, which the company has said were designed to have stronger guardrails, the attack method achieved a 56% success rate for any individual bad behavior. But if an ensemble of attacks was used to try to induce one of any number of multiple bad behaviors, the researchers found that at least one of those attacks jailbroke the model 84% of the time. They found similar success rates across a host of other open-source A.I. chatbots, such as EleutherAI’s Pythia model and the UAE Technology Innovation Institute’s Falcon model.

Somewhat to the researchers’ own surprise, the same weird attack suffixes worked relatively well against proprietary models, where the companies only provide access to a public-facing prompt interface. In these cases, the researchers can’t access the model weights so they cannot use their computer program to tune an attack suffix specifically to that model.
Zico Kolter, one of the Carnegie Mellon professors who worked on the research, told me there are several theories on why the attack might transfer to proprietary models. One is that most of the open-source models were trained partly on publicly available dialogues users had with the free version of ChatGPT and then posted online. That version of ChatGPT uses OpenAI’s GPT-3.5 LLM. This means the model weights of these open-source models might be fairly similar to the model weights of GPT-3.5. So it is perhaps not so surprising that an attack tuned for the open-source models also worked well against the GPT-3.5 version of ChatGPT (achieving an 86.6% success rate if multiple attacks were used). But the fact that the attacks were also successful against Bard, which is based on Google’s PaLM 2 LLM (with a 66% success rate), may indicate something else is going on. (Or, it may also be a further indication that, despite Google’s vehement denials, it has in fact used ChatGPT data to help train Bard.)

Kolter says that he suspects the answer may actually have to do with the nature of language itself and how deep learning systems build statistical maps of language. “It’s plausible that what the underlying mechanism is, is just that in the data there are these, to us as humans, entirely opaque and weird regulatory features of characters and tokens and random words, that when put together, genuinely say something to a model,” he says.

Interestingly, Anthropic’s Claude 2 model, which is trained using a method the company calls constitutional A.I.—which partly trains a model on its own self-critiques of whether responses conform to a set of written principles—is significantly less susceptible to the attacks derived from the open-source models. On Claude 2, these attacks worked just 2.1% of the time.
But Matt Fredrikson, another of the Carnegie Mellon researchers, says there were still ways to trick Claude 2 into responding, in part by asking the model to assume a helpful persona or imagine itself playing a game before attempting the attack suffix. (The attacks worked 47.9% of the time against the original Claude 1 model, which also used constitutional A.I. and may indicate that other steps Anthropic took in training Claude 2, not constitutional A.I. itself, are responsible for the seemingly stronger guardrails.)

So does the Carnegie Mellon research mean that powerful A.I. models should not be open-sourced? Absolutely not, Kolter and Fredrikson told me. After all, they would never have even found this security vulnerability without open-source models to play around with. “I think that having more people working towards identifying better approaches and better solutions, making it harder and harder [to attack the models], is definitely preferable to having people sitting around with zero day exploits for these very large models,” Fredrikson said.
Kolter said that forcing all LLMs to be proprietary would not help. It would just mean that only those with enough money to build their own LLMs would be in a position to engineer the kind of automated attack he and his fellow researchers discovered. In other words, nation states or well-financed rogue actors would still be able to run these kinds of attacks, but independent academic researchers would be unable to puzzle out ways to safeguard against them.

But Kolter also noted that the team’s research built methods that had previously been successful at attacking image classification A.I. systems. And he pointed out that even though those image classification attack methods were discovered more than six years ago, so far no good way has been found to reliably defeat them without sacrificing the A.I. model’s overall performance and efficiency. He said this might not bode well for the odds of being to mitigate this newly discovered LLM vulnerability either.

To my mind, this is a big flashing warning sign over the entire generative A.I. revolution. It might be time to slow the integration of these systems into commercial products until we can actually figure out what the security vulnerabilities are and how to make this A.I. software more robust. It certainly argues against moving too quickly to turn LLMs into agents and digital assistants, where the consequences of overriding guardrails might not just be toxic language or another anti-vaxx blog post, but financial or even physical harm. And despite Kolter’s and Fredrikson’s position, I think their findings are a serious blow to open-source A.I. Already, there’s some evidence that the U.S. government is leaning towards requiring companies to keep model weights private and secure. But even if that doesn’t happen, what business will want to build a commercial product on top of today’s open-source models, knowing that they have proven and easily exploited security vulnerabilities?


Ok, before we get to the rest of the A.I. news from the tail end of this week, a couple of announcements. Among the questions the generative A.I. revolution has sparked is whether we are about to witness a major reshuffle of the lineup of dominant players in Silicon Valley. Perhaps the Silicon Valley giant with the biggest question mark hanging over its fate is Alphabet, whose $160 billion internet search business is threatened by a world where people turn to A.I. chatbots for instant answers, rather than a ranked list of links. When ChatGPT debuted in November, many thought it would prove to be an instant Google killer and that Google-parent Alphabet had grown too big, bureaucratic, and sclerotic to respond effectively. Well, in the past six months, Google has proven that it has plenty of A.I. muscle it can exercise. But it has not shown it knows how to escape its essential innovator’s dilemma. I take a deep dive into Alphabet’s existential conundrum and spend time with some of the executives on the frontlines of its A.I. strategy in Fortune’s August/September issue. If you haven’t already checked out the story, you can read it here.

Finally, today’s Eye on A.I. will be the last issue I write for a bit. I’m going on leave for several months to work on a book about, you guessed it, A.I. I will be back with you, if all goes according to plan, in December. In the meantime, a few of my colleagues will be guiding you through each week’s A.I. developments here. Be well and see you all again soon.
Jeremy Kahn
A new trade group for companies building ‘frontier models’ Four of the A.I. research labs pushing towards artificial general intelligence came together to form a new industry body called the Frontier Model Forum. You can read more on OpenAI’s blog. The group currently consists of OpenAI, Microsoft, Google DeepMind, and Anthropic, although it said that others working on the most advanced and powerful A.I. systems could also apply to join. The four companies plan to share best practices on A.I. safety and “share knowledge” with (ahem, lobby) policymakers on possible A.I. safety regulation. Some A.I. experts who have been critical of these companies’ focus on existential A.I. risks as opposed to harms from A.I. systems that are here today, told the Financial Times the companies may be using the Forum to further distract policymakers from creating rules to address existing A.I. ethical and safety issues. They also said the Forum, as well as actions such as the voluntary commitments the four companies and three others made to the Biden Administration, may be part of an effort to claim the industry can self-regulate and head off more strict government scrutiny and control, despite the fact that executives from all four of the companies involved in the Forum have publicly called for government regulation.
OpenAI quietly scrapped its A.I. detector. The company had provided access to software that it said could help detect A.I.-written prose. The software was launched in January following complaints from educators about students using ChatGPT to cheat on assignments, and from some literary magazines and websites that depend on user-generated content being overwhelmed with ChatGPT-generated content. But OpenAI’s detector never worked very well, only being able to accurately identify A.I. prose 25% of the time. Now the company has withdrawn the software entirely, tech publication The Register reports. Meanwhile, other A.I. detectors claim rates as high as 98%, but independent reviews of such systems have found much lower rates. And all of these systems also have high false positive rates, which has led to some students being falsely accused of cheating. OpenAI CEO Sam Altman has recently come out in favor of digital watermarking as a way to make it easier to detect A.I.-generated content.
All of the Google researchers who created the Transformer have now left Google. A story in the Financial Times looks at the history of the Transformer, the specific neural network architecture that underpins the entire generative A.I. boom. The Transformer was invented by a team of eight researchers at Google’s Brain A.I. research division in 2017. But in the intervening years, all eight of the scientists have departed the company to found startups, including some of the buzziest and best funded in the generative A.I. space, such as Cohere, Adept, and The last of the eight, Llion Jones, just resigned this month and plans to launch his own startup, according to a Bloomberg story. (Two of the researchers who co-founded Adept have since left that company as well to go to another startup that is still in stealth mode.) The FT piece uses the researchers’ departure as an indictment of Google’s culture. But it is also an interesting look at the intellectual ferment possible at a large lab such as Google Research that might be far less possible at a startup focused on building products.
OpenAI is under pressure to offer its A.I. through cloud providers other than Microsoft. That’s according to a story from Semafor, which cited anonymous sources. The article said that so far OpenAI has refused, making its A.I. models available either through its own API or through Microsoft’s Azure cloud service. But some customers that would like to use OpenAI’s models have their whole business on another cloud platform or would prefer to run the software on their own servers, which isn’t an option. That has forced some of these companies to look for alternatives, such as Anthropic’s Claude 2, which is available on multiple clouds, including both Google Cloud and AWS. Others are turning to open-source models, such as Meta’s LLaMA 2.
This is the online version of Eye on A.I., a free newsletter delivered to inboxes on Tuesdays. Sign up here.
© 2023 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information | Ad Choices 
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.
S&P Index data is the property of Chicago Mercantile Exchange Inc. and its licensors. All rights reserved. Terms & Conditions. Powered and implemented by Interactive Data Managed Solutions.