AI chatbots remain overconfident—even when they're wrong, study finds – Tech Xplore
Sign in with
Forget Password?
Learn more
share this!
25
Tweet
Share
Email
July 22, 2025
by Carnegie Mellon University
edited by Sadie Harley, reviewed by Andrew Zinin
scientific editor
lead editor
This article has been reviewed according to Science X’s editorial process and policies. Editors have highlighted the following attributes while ensuring the content’s credibility:
fact-checked
peer-reviewed publication
trusted source
proofread
Artificial intelligence chatbots are everywhere these days, from smartphone apps and customer service portals to online search engines. But what happens when these handy tools overestimate their own abilities?
Researchers asked both human participants and four large language models (LLMs) how confident they felt in their ability to answer trivia questions, predict the outcomes of NFL games or Academy Award ceremonies, or play a Pictionary-like image identification game. Both the people and the LLMs tended to be overconfident about how they would hypothetically perform. Interestingly, they also answered questions or identified images with relatively similar success rates.
However, when the participants and LLMs were asked retroactively how well they thought they did, only the humans appeared able to adjust expectations, according to a study published in the journal Memory & Cognition.
“Say the people told us they were going to get 18 questions right, and they ended up getting 15 questions right. Typically, their estimate afterwards would be something like 16 correct answers,” said Trent Cash, who recently completed a joint Ph.D. at Carnegie Mellon University in the departments of Social Decision Science and Psychology. “So, they’d still be a little bit overconfident, but not as overconfident.”
“The LLMs did not do that,” said Cash, who was lead author of the study. “They tended, if anything, to get more overconfident, even when they didn’t do so well on the task.”
The world of AI is changing rapidly each day, which makes drawing general conclusions about its applications challenging, Cash acknowledged.
However, one strength of the study was that the data was collected over the course of two years, which meant using continuously updated versions of the LLMs known as ChatGPT, Bard/Gemini, Sonnet and Haiku. This means that AI overconfidence was detectable across different models over time.
“When an AI says something that seems a bit fishy, users may not be as skeptical as they should be because the AI asserts the answer with confidence, even when that confidence is unwarranted,” said Danny Oppenheimer, a professor in CMU’s Department of Social and Decision Sciences and co-author of the study.
“Humans have evolved over time and practiced since birth to interpret the confidence cues given off by other humans. If my brow furrows or I’m slow to answer, you might realize I’m not necessarily sure about what I’m saying, but with AI, we don’t have as many cues about whether it knows what it’s talking about,” said Oppenheimer.
While the accuracy of LLMs at answering trivia questions and predicting football game outcomes is relatively low stakes, the research hints at the pitfalls associated with integrating these technologies into daily life.
For instance, a recent study conducted by the BBC found that when LLMs were asked questions about the news, more than half of the responses had “significant issues,” including factual errors, misattribution of sources and missing or misleading context. Similarly, another study from 2023 found LLMs “hallucinated,” or produced incorrect information, in 69 to 88% of legal queries.
Clearly, the question of whether AI knows what it’s talking about has never been more important. And the truth is that LLMs are not designed to answer everything users are throwing at them on a daily basis.
“If I’d asked ‘What is the population of London,’ the AI would have searched the web, given a perfect answer and given a perfect confidence calibration,” said Oppenheimer.
However, by asking questions about future events—such as the winners of the upcoming Academy Awards—or more subjective topics, such as the intended identity of a hand-drawn image, the researchers were able to expose the chatbots’ apparent weakness in metacognition—that is, the ability to be aware of one’s own thought processes.
“We still don’t know exactly how AI estimates its confidence,” said Oppenheimer, “but it appears not to engage in introspection, at least not skillfully.”
The study also revealed that each LLM has strengths and weaknesses. Overall, the LLM known as Sonnet tended to be less overconfident than its peers. Likewise, ChatGPT-4 performed similarly to human participants in the Pictionary-like trial, accurately identifying 12.5 hand-drawn images out of 20, while Gemini could identify just 0.93 sketches, on average.
In addition, Gemini predicted it would get an average of 10.03 sketches correct, and even after answering fewer than one out of 20 questions correctly, the LLM retrospectively estimated that it had answered 14.40 correctly, demonstrating its lack of self-awareness.
“Gemini was just straight up really bad at playing Pictionary,” said Cash. “But worse yet, it didn’t know that it was bad at Pictionary. It’s kind of like that friend who swears they’re great at pool but never makes a shot.”
For everyday chatbot users, Cash said the biggest takeaway is to remember that LLMs are not inherently correct and that it might be a good idea to ask them how confident they are when answering important questions.
Of course, the study suggests LLMs might not always be able to accurately judge confidence, but in the event that the chatbot does acknowledge low confidence, it’s a good sign that its answer cannot be trusted.
The researchers note that it’s also possible that the chatbots could develop a better understanding of their own abilities over vastly larger data sets.
“Maybe if it had thousands or millions of trials, it would do better,” said Oppenheimer.
Ultimately, exposing the weaknesses such as overconfidence will only help those in the industry that are developing and improving LLMs. And as AI becomes more advanced, it may develop the metacognition required to learn from its mistakes.
“If LLMs can recursively determine that they were wrong, then that fixes a lot of the problems,” said Cash.
“I do think it’s interesting that LLMs often fail to learn from their own behavior,” said Cash. “And maybe there’s a humanist story to be told there. Maybe there’s just something special about the way that humans learn and communicate.”
More information: Quantifying Uncert-AI-nty: Testing the Accuracy of LLMs’ Confidence Judgments, Memory & Cognition (2025). DOI: 10.3758/s13421-025-01755-4
Explore further
Facebook
Twitter
Email
Feedback to editors
5 hours ago
0
Jul 22, 2025
2
Jul 21, 2025
0
Jul 19, 2025
1
Jul 18, 2025
0
9 minutes ago
3 hours ago
3 hours ago
3 hours ago
3 hours ago
3 hours ago
5 hours ago
21 hours ago
Jul 22, 2025
Jul 22, 2025
Jul 16, 2025
Sep 27, 2024
Jun 23, 2025
Jul 8, 2025
Oct 16, 2024
May 22, 2025
5 hours ago
3 hours ago
3 hours ago
3 hours ago
Jul 22, 2025
Jul 22, 2025
Large language models (LLMs) and humans both tend to overestimate their performance on various tasks, but only humans adjust their confidence after seeing results. LLMs often remain or become more overconfident, even when incorrect, indicating limited metacognitive ability. This persistent overconfidence poses risks in real-world applications, as users may overtrust AI-generated responses.
This summary was automatically generated using LLM. Full disclaimer
Use this form if you have come across a typo, inaccuracy or would like to send an edit request for the content on this page. For general inquiries, please use our contact form. For general feedback, use the public comments section below (please adhere to guidelines).
Please select the most appropriate category to facilitate processing of your request
Thank you for taking time to provide your feedback to the editors.
Your feedback is important to us. However, we do not guarantee individual replies due to the high volume of messages.
Your email address is used only to let the recipient know who sent the email. Neither your address nor the recipient’s address will be used for any other purpose. The information you enter will appear in your e-mail message and is not retained by Tech Xplore in any form.
Daily science news on research developments and the latest scientific innovations
Medical research advances and health news
The most comprehensive sci-tech news coverage on the web
This site uses cookies to assist with navigation, analyse your use of our services, collect data for ads personalisation and provide content from third parties. By using our site, you acknowledge that you have read and understand our Privacy Policy and Terms of Use.