Why AI Chatbots Still Hallucinate: Researchers Trace Errors to Training Data Gaps and Misaligned Benchmarks – Digital Information World


Large language models are trained by scanning enormous volumes of text and learning to predict what word should come next. That process gives them fluency, but it also builds in errors. The team’s paper explains that even with perfectly clean training data, mistakes are mathematically inevitable.
Some facts are simply too rare for a system to learn. A birthday that appears once in a dataset, for example, provides no pattern the model can generalize. The authors call this the “singleton rate.” High singleton rates mean a model will almost certainly invent details when asked about them. This is why common knowledge tends to be correct, while obscure details often come back scrambled.

The training phase is only half the story. After that, models are fine-tuned to better match human expectations. But the way they are tested keeps the cycle going.
Benchmarks usually grade answers as right or wrong. There’s no credit for admitting uncertainty. A chatbot that says “I don’t know” is punished as harshly as one that blurts out something false. Under that system, guessing is the smarter move. Over time, models are effectively trained to bluff.
The researchers compare this to multiple-choice exams. Students who leave blanks score lower than those who make lucky guesses. AI models, shaped by similar scoring, act in much the same way.
Examples from the study illustrate how deep the problem runs. One widely used model was asked for Adam Kalai’s birthday — Kalai being one of the paper’s authors. It gave three different dates across separate attempts. None were right, and it had been told to answer only if certain.
In another test, a system failed at counting the letters in a word, producing results that made little sense. These cases show both the arbitrary fact problem and what the authors call poor model representation, where the structure of the system limits its ability to handle simple tasks.
The researchers argue the solution lies in evaluation. Instead of rewarding risky guesses, new benchmarks should penalize confident wrong answers more than admissions of uncertainty. One option is to grant partial credit when a model holds back. Another is to set confidence thresholds in the test instructions, telling the model to answer only if it reaches a defined level of certainty.
This echoes older exam systems where wrong guesses were penalized, discouraging blind attempts. The same principle could shift AI development toward models that value accuracy over bravado.
The study makes clear that hallucinations will not vanish completely. Some questions are inherently unanswerable because the data is missing, ambiguous, or too complex. But better testing could reduce the most damaging errors and build greater trust in AI systems.
The broader point is that hallucinations are not random glitches. They are the product of how models are trained, and more importantly, how they are judged. If the industry changes the scoreboards, the behavior of the models is likely to follow.

Notes: This post was edited/created using GenAI tools. 

Read next:

• AI Models Can Now Run Ransomware Attacks on Their Own, Study Finds

• Secure Online Transactions and Business Models in E-commerce and Marketplaces

• Chatbots Are Spreading More False Claims, NewsGuard Report Shows

source

Jesse
https://playwithchatgtp.com