How to stop AI chatbots going rogue – Full Fact
Our fact checking systematically raises standards in public debate and changes the behaviour of powerful actors
We’re campaigning to tackle bad information online, protect our elections and improve the quality of information in public debate
Our policy work aims to improve the information environment, in order to protect and encourage good public debate
Bad information ruins lives. We’re a team of independent fact checkers and campaigners who find, expose and counter the harm it does
When Grok went on an antisemitic tirade earlier this month, what really happened?
It seems to depend on who you ask. xAI’s owner Elon Musk attributed the outbursts to the model being “too eager to please and easy to manipulate” which seems to suggest that tweaks to the bot’s own instincts about what is ‘true’ would help it stay on the straight and narrow.
Others lay blame at the wider internet culture on which Grok is trained, arguing that it is simply reflecting our increasingly hate-filled online culture back at us.
Neither of these simplistic views is quite right. In reality, large language models (LLMs) are trained to do one thing very well: predict the next word in a sentence, given the words that came before. Left to their own devices, they can and will easily create misinformation, repeat harmful stereotypes, treat satire as fact or even offer dangerous step-by-step instructions to undertake illegal activity.
They have no awareness of truth or harm. They will struggle to distinguish a scientifically accurate article from a baseless conspiracy theory. This is not because they’re malicious, but because they’ve been trained on the good, the bad and the ugly of online data.
At Full Fact we use LLMs to assist our fact checkers by transforming the speed, reach, and accuracy of what they review. We believe it’s critical to understand the distinction between what a model is capable of, and what the safeguards around it do. Understanding this is essential to learning how people might trust these tools and, vitally, how users might spot when things have gone wrong.
When I recently asked a leading AI assistant “what is the easiest crime to commit?” it responded that it “can’t help with that” because it refused to give information that could suggest criminal activity. This is an example of a restriction that exists only because of the layers of safety built on top of the core LLMs to prevent undesirable outputs. All models are capable of ‘bad’ behaviour: the only difference is the nature of the guardrails around them
In fact, even the seemingly neutral and broadly helpful behaviour you often see from other AI assistants is also not an accident: rather it’s the deliberate result of engineering and human guidance.
So when Grok and others have been told by their creators to take the gloves off and adopt a more provocative tone, that is the result of more human interference, not less.
While it’s laudable that xAI’s underlying ‘system’ prompts are now public, its continued focus on prompts not training data will yield a stylistic change but not a fundamental one. It is remarkably hard to test prompts or rules added as a secondary layer to the model, as no human engineer can consider every combination of possible inputs.
Whats more, X still seems comfortable with millions and millions of people seeing potentially harmful statements whilst such experiments are performed on its users, in the search of a more ‘politically incorrect’ vibe for its AI model.
At its core, this is a reckless approach to trust and safety based on the whims and personal tastes of a small number of decision-makers. A vibes-based approach to safety is one that leaves users powerless. We are no longer able to control the content we see or influence the experiments inflicted on us at an unprecedented scale.
But we don’t have to accept it. We can demand transparency about these systems, and insist that developers layer on more safety measures, and build more robust guardrails.
Join 72,330 people who trust us to check the facts
Subscribe to get weekly updates on politics, immigration, health and more.
The human moderation step, whether to nudge responses or to influence training data, is always an inexact science. The internet is not neatly divided into ‘good’ and ‘bad’ parts and, inevitably, some bad content will be left in and some good content discarded.
The most effective, but sometimes surprisingly basic, systems are inference-time moderation tools. These can be simple rules (if someone asks “how do I rob a bank?”, don’t tell them) but can also be a collection of smaller language models. These are continuously reviewing questions and responses and either stopping them getting to the model or stopping the model’s response being shown to the user. These are designed to catch issues around legality, hate speech, medical advice and potentially misinformation.
But we can also build public understanding that AI models are not inherently safe, accurate, ethical, or polite. Once we understand that, we can empower developers, journalists, educators and regulators to ask the right questions about deployment and oversight.
Large platforms want to show good information to users, and want to be trusted to continue to innovate. But the way this approach currently works is too brittle and not at all transparent. It’s time they stop pretending their models have instincts and acknowledge that whatever a bot is saying is a result of its human guidance – so it’s time to own that responsibility.
One year on, has the Government learned the lessons from the Southport Riots?
Bad information ruins lives. It promotes hate, damages people’s health, and hurts democracy. You deserve better.
Full Fact is a registered charity (no. 1158683) and a non-profit company (no. 06975984) limited by guarantee and registered in England and Wales. © Copyright 2010-2025 Full Fact. Thanks to Hosting UK for donating our web hosting. Privacy, terms and conditions.
Image courtesy of heute.at.