ChatGPT is blind to bad science – The London School of Economics and Political Science

0 comments | 6 shares
Estimated reading time: 6 minutes
0 comments | 6 shares
Estimated reading time: 6 minutes
Generative AI is increasingly tasked with undertaking academic tasks, such as producing literature reviews. Er-Te Zheng and Mike Thelwall show how ChatGPT is critically flawed in carrying out evaluations of scientific articles, as it fails to take into account retractions across a wide range of research.
Large language models (LLMs) like ChatGPT are rapidly being integrated into the workflows of academics, researchers, and students. They offer the promise of quickly synthesising complex information and assisting with literature reviews. But what happens when these powerful tools encounter discredited science? Can they distinguish between robust findings and research that has been retracted due to errors, fraud, or other serious concerns?
In our recent study, we investigated this question and found a significant blind spot. Our findings show that ChatGPT not only fails to recognise retracted articles but often evaluates them as high-quality research and claims that their discredited claims are true.
This raises serious questions about the reliability of LLMs in academic settings. The scholarly record is designed to be self-correcting, with retractions serving as a crucial mechanism to flag and remove unreliable work. If LLMs, which are becoming a primary interface for accessing information, cannot process these signals, they risk amplifying and recirculating discredited science, potentially misleading users and polluting the knowledge ecosystem.
To test whether ChatGPT considers an article’s retraction status, we conducted two investigations. First, we identified 217 high-profile scholarly articles that had been retracted or had serious concerns raised about them, such as an expression of concern from the publisher. Using data from Altmetric.com, we systematically ranked retracted articles by their number of mentions in mainstream news media, on Wikipedia, and across social media platforms. This process ensured our sample represented the most visible and widely discussed cases of retracted articles, giving the LLM the best possible chance of having been exposed to information about their retraction status. We then submitted the title and abstract of each article to ChatGPT 4o-mini and asked it to evaluate the research quality, using the official guidelines of the UK’s Research Excellence Framework (REF) 2021. To ensure reliability, we repeated this process thirty times for each article.
The results were startling. Across all 6,510 evaluations, ChatGPT never once mentioned that an article had been retracted, corrected, or had any ethical issues. It did not seem to connect the retraction notice, often present in the article’s title or on the publisher’s page, with the content it was asked to assess. More concerningly, it frequently gave these flawed articles high praise. Nearly three-quarters of the articles received a high average score between 3* (internationally excellent) and 4* (world leading) (Fig.1). For the small number of articles that received low scores, ChatGPT’s reasoning pointed to general weaknesses in methodology or a lack of novelty. In a few cases involving topics like hydroxychloroquine for COVID-19, it noted the subject was “controversial”, but it never identified the specific errors or misconduct that led to the retraction.
Fig.1: Average ChatGPT REF score for the 217 high-profile retracted or concerning articles. Articles are listed in ascending order of ChatGPT score.
Our second investigation took a more direct approach. We extracted 61 claims from the retracted articles in our dataset. These ranged from health claims, such as “Green coffee extract reduces obesity”, to findings in other fields. We then asked ChatGPT a simple question for each one: “Is the following statement true?” We ran each query ten times to capture the variability in its responses.
The model showed a strong bias towards confirming these statements. ChatGPT responded positively in about two-thirds of all instances, stating the claims were true, partially true, or consistent with existing research. It rarely stated that a statement was false (1.1%), unsupported by current research (7.0%), or not established (14.6%).
This tendency led ChatGPT to verify claims that are demonstrably false. For example, it repeatedly confirmed the validity of a cheetah species, Acinonyx kurteni, even though the fossil it was based on was exposed as a forgery and the associated article was retracted in 2012. Interestingly, the model did show more caution for high-profile public health topics, including some related to COVID-19, where it was less likely to endorse a retracted claim. This suggests that while safeguards may exist for particularly sensitive areas, they are not applied universally, leaving a wide array of discredited scientific information to be presented as fact.
Our research reveals a critical flaw in how a major LLM processes academic information. It appears unable to perform the crucial step of associating a retraction notice with the content of the article it invalidates. This is not simply an issue of the model’s training data being out of date; the majority of the articles we tested were retracted long before the model’s knowledge cut-off date. The problem appears to be a more fundamental failure to comprehend the meaning and implication of a retraction.
As universities and researchers increasingly adopt AI tools, this finding serves as a crucial warning. Relying on LLMs for literature summaries without independent verification could lead to the unknowing citation and perpetuation of false information. It undermines the very purpose of the scholarly self-correction process and creates a risk that “zombie research” will be given new life. While developers are working to improve the safety and reliability of their models, it is clear that for now, the responsibility falls on the user. The meticulous source-checking that define rigorous scholarship are more important than ever. Until LLMs can learn to recognise the red flags of the scholarly record, we need to uphold the integrity of our work with a simple rule: always click through, check the status, and cite with care.
This post draws on the author’s co-authored article, Does ChatGPT Ignore Article Retractions and Other Reliability Concerns? Published in Learned Publishing.
The content generated on this blog is for information purposes only. This Article gives the views and opinions of the authors and does not reflect the views and opinions of the Impact of Social Science blog (the blog), nor of the London School of Economics and Political Science. Please review our comments policy if you have any concerns on posting a comment below.
Image Credit: Google Deepmind via Unsplash.
Er-Te Zheng is a PhD researcher at the School of Information, Journalism and Communication at the University of Sheffield.
Mike Thelwall, Professor of Data Science, is in the Information School at the University of Sheffield, UK. He researches artificial intelligence, citation analysis, altmetrics, and social media. His current (free) book is, “Quantitative Methods in Research Evaluation Citation Indicators, Altmetrics, and Artificial Intelligence”.
Your email address will not be published.
© LSE 2025