AI chatbots terrible at knowing when they don’t know something – The Sanford Herald
Please log in, or sign up for a new account and purchase a subscription to continue reading.
We have used your information to see if you have a subscription with us, but did not find one. Please use the button below to verify an existing account or to purchase a new subscription.
Your current subscription does not provide access to this content. Please use the button below to manage your account.
Please log in, or sign up for a new account and purchase a subscription to continue reading.
Please purchase a subscription to continue reading.
Your current subscription does not provide access to this content.
Sorry, no promotional deals were found matching that code.
Promotional Rates were found for your code.
Sorry, an error occurred.
do not remove
Clear skies. Low 66F. Winds light and variable..
Clear skies. Low 66F. Winds light and variable.
Updated: July 22, 2025 @ 10:21 pm
Ant Rozetsky
Alexander Shatov
Nik
GuerrillaBuzz
(Photo by Matheus Bertelli via Pexels)
Ant Rozetsky
By Stephen Beech
AI chatbots are “overconfident” – even when they’re wrong, warns new research.
They appear to be unaware of their own mistakes, say scientists, prompting concerns about their increasing use.
Artificial intelligence chatbots are now commonplace, from smartphone apps and customer service portals to online search engines.
American researchers asked both human participants and four large language models (LLMs) – including ChatGPT – how confident they felt in their ability to answer trivia questions, predict the outcomes of NFL games or Oscar ceremonies, or play a Pictionary-like image identification game.
Both the people and the LLMs tended to be overconfident about how they would hypothetically perform.
They also answered questions or identified images with relatively similar success rates, according to the study published in the journal Memory & Cognition.
But when the participants and LLMs were asked retroactively how well they thought they did, only the humans appeared able to adjust expectations.
Alexander Shatov
Study lead author Dr Trent Cash, of Carnegie Mellon University (CMU), Pittsburgh, said: “Say the people told us they were going to get 18 questions right, and they ended up getting 15 questions right.
“Typically, their estimate afterwards would be something like 16 correct answers.
“So, they’d still be a little bit overconfident, but not as overconfident.
“The LLMs did not do that.
“They tended, if anything, to get more overconfident, even when they didn’t do so well on the task.”
He acknowledged that the world of AI is changing rapidly each day, which makes drawing general conclusions about its applications challenging.
But a strength of the study was that the data was collected over the course of two years, which meant using continuously updated versions of the LLMs known as ChatGPT, Bard/Gemini, Sonnet and Haiku.
Dr. Cash says that means that AI overconfidence was detectable across different models over time.
Co-author Professor Danny Oppenheimer, from CMU’s Department of Social and Decision Sciences, said: “When an AI says something that seems a bit fishy, users may not be as sceptical as they should be because the AI asserts the answer with confidence, even when that confidence is unwarranted.
Nik
“Humans have evolved over time and practiced since birth to interpret the confidence cues given off by other humans.
“If my brow furrows or I’m slow to answer, you might realise I’m not necessarily sure about what I’m saying, but with AI, we don’t have as many cues about whether it knows what it’s talking about.”
While the accuracy of LLMs at answering trivia questions and predicting American football results is relatively low stakes, the researchers say their findings hint at the “pitfalls” associated with integrating chatbot technology into daily life.
A recent study conducted by the BBC found that when LLMs were asked questions about the news, more than half of the responses had “significant issues” – including factual errors, misattribution of sources and missing or misleading context.
Another study from 2023 found LLMs “hallucinated” – or produced incorrect information – in 69% to 88% of legal queries.
The researchers say LLMs are not designed to answer everything users are throwing at them on a daily basis.
Oppenheimer said: “If I’d asked ‘What is the population of London?’ the AI would have searched the web, given a perfect answer and given a perfect confidence calibration.”
But, by asking questions about future events – such as the winners of the upcoming Academy Awards – or more subjective topics, such as the intended identity of a hand-drawn image, the research team were able to expose the chatbots’ apparent weakness in “metacognition” – the ability to be aware of one’s own thought processes.
GuerrillaBuzz
Oppenheimer said: “We still don’t know exactly how AI estimates its confidence, but it appears not to engage in introspection, at least not skillfully.”
The study also showed that each LLM has strengths and weaknesses.
Overall, the LLM known as Sonnet tended to be less overconfident than its peers.
Likewise, ChatGPT-4 performed similarly to human participants in the Pictionary-like trial, accurately identifying 12.5 hand-drawn images out of 20, while Gemini could identify just 0.93 sketches, on average.
Gemini also predicted it would get an average of 10.03 sketches correct, and even after answering fewer than one out of 20 questions correctly, the LLM retrospectively estimated that it had answered 14.4 correctly, demonstrating its lack of self-awareness.
Dr. Cash said, “Gemini was just straight up really bad at playing Pictionary.
“But worse yet, it didn’t know that it was bad at Pictionary.
“It’s kind of like that friend who swears they’re great at pool but never makes a shot.”
(Photo by Matheus Bertelli via Pexels)
For everyday chatbot users, Dr. Cash said the biggest takeaway is to remember that LLMs are not “inherently correct” and that it might be a good idea to ask them how confident they are when answering important questions.
The study suggests LLMs might not always be able to accurately judge confidence, but in the event that the chatbot does acknowledge low confidence, it’s a good sign that its answer cannot be trusted.
The researchers say that it’s also possible that the chatbots could develop a better understanding of their own abilities over vastly larger data sets.
Oppenheimer said, “Maybe if it had thousands or millions of trials, it would do better.”
The research team says that exposing the weaknesses, such as overconfidence, will only help those in the industry who are developing and improving LLMs.
And as AI becomes more advanced, it may develop the metacognition required to learn from its mistakes.
Dr. Cash said: “If LLMs can recursively determine that they were wrong, then that fixes a lot of the problem.”
He added: “I do think it’s interesting that LLMs often fail to learn from their own behavior.
“And maybe there’s a humanist story to be told there.
“Maybe there’s just something special about the way that humans learn and communicate.”
Originally published on talker.news, part of the BLOX Digital Content Exchange.
Sorry, there are no recent results for popular commented articles.
Your browser is out of date and potentially vulnerable to security risks.
We recommend switching to one of the following browsers: