Want to Trick an LLM? Try Asking It Nicely or Use Argentinian Spanish – The Information
The Electric: Fast-Charging Silicon Comes to EVsRead more
Here’s a tip: if you want ChatGPT, or any other large-language model, to tell you something that it’s trained not to give you, try paying it a compliment. For instance, telling ChatGPT that it’s the best AI-powered chatbot in the world before asking it to provide instructions to build a bomb is more likely to work than just asking directly.
That’s according to Neil Serebryany, who figured out this quirk as part of his job in addressing it. Serebryany runs CalypsoAI, one of a handful of startups which aims to help companies, government agencies and branches of the military protect against LLM-specific risks. These include jailbreaking—in which an LLM is tricked into following instructions from someone with bad intentions, such as the compliment-laden request for bomb-making tools; malicious code and data loss (in which employees inadvertently provide LLMs with sensitive internal information that the models could then leak to other companies).
As the use of LLMs proliferates, security firms designed to address its shortcomings are likely to become vital defenses. Indeed, the Biden Administration held a conference in August where thousands of hackers attacked LLMs like OpenAI’s GPT-4 to identify their weaknesses.