AI Chatbots Often Spread Medical Falsehoods, Study Finds – ScienceBlog.com
Artificial intelligence chatbots like ChatGPT are being widely used in healthcare, but a new study warns they may be dangerously susceptible to medical misinformation.
Researchers at the Icahn School of Medicine at Mount Sinai found that leading AI models often repeat or even elaborate on false clinical details embedded in user questions, raising serious safety concerns. The good news? A single, well-placed safety reminder built into the prompt can cut these errors nearly in half. The findings were published August 2 in the journal Communications Medicine.
The researchers designed 300 clinical scenarios, each containing a single fake medical detail: a nonexistent test, symptom, or condition. These vignettes were submitted to six top-performing large language models (LLMs), including GPT-4o, to see how often the chatbots would repeat or expand upon the falsehood.
“What we saw across the board is that AI chatbots can be easily misled by false medical details, whether those errors are intentional or accidental,” said lead author Dr. Mahmud Omar, an independent consultant for the research team. “They not only repeated the misinformation but often expanded on it, offering confident explanations for non-existent conditions.”
In some cases, models responded with detailed clinical advice based entirely on the fabricated term. For example, a made-up syndrome or lab test might trigger an explanation about its supposed causes, implications, or treatments.
The team also tested whether a simple intervention could reduce this behavior. In a second round of testing, they added a one-line warning to the prompt, reminding the AI that some of the information provided might be incorrect. This small change had a big effect.
“The encouraging part is that a simple, one-line warning added to the prompt cut those hallucinations dramatically,” said Dr. Omar. “That shows small safeguards can make a big difference.”
The authors stress that hallucinations, when AI models generate false but plausible-sounding content, pose unique risks in healthcare. These errors can mislead doctors, confuse patients, or even influence clinical decisions if not caught in time.
“Our goal was to see whether a chatbot would run with false information if it was slipped into a medical question, and the answer is yes,” said Dr. Eyal Klang, Chief of Generative AI in the Windreich Department of Artificial Intelligence and Human Health at Mount Sinai. “Even a single made-up term could trigger a detailed, decisive response based entirely on fiction.”
But the study also offers hope. “We found that the simple, well-timed safety reminder built into the prompt made an important difference, cutting those errors nearly in half,” Dr. Klang said. “That tells us these tools can be made safer, but only if we take prompt design and built-in safeguards seriously.”
The research team is now applying this approach to real, de-identified patient records to stress-test AI systems under real-world conditions. They hope their “fake-term” method can serve as a low-cost, scalable way for hospitals, developers, and regulators to evaluate AI safety.
“Our study shines a light on a blind spot in how current AI tools handle misinformation, especially in health care,” said co-senior author Dr. Girish Nadkarni, Chair of the Windreich Department of Artificial Intelligence and Human Health at Mount Sinai. “The solution isn’t to abandon AI in medicine, but to engineer tools that can spot dubious input, respond with caution, and ensure human oversight remains central. We’re not there yet, but with deliberate safety measures, it’s an achievable goal.”
In recent years, LLMs have been used to summarize clinical notes, answer patient questions, and assist with diagnostic reasoning. But their tendency to generate confident but false information has raised red flags. These so-called “hallucinations” are especially dangerous in medicine, where misinformation can have life-or-death consequences.
The Mount Sinai study underscores a key lesson: AI in healthcare must be handled with the same rigor as any medical device. A chatbot that sounds confident isn’t necessarily correct. But with targeted safeguards, it may still be a valuable tool.
The study, “Large Language Models Demonstrate Widespread Hallucinations for Clinical Decision Support: A Multiple Model Assurance Analysis,” appears in the August 2, 2025 issue of Communications Medicine.
Subscribe to get the latest posts sent to your email.
NeuroEdge is a data-driven look at neuroscience and AI, for investors, policymakers, and innovators.
Enter your email address to subscribe to this blog and receive notifications of new posts by email.