ChatGPT-4 is up to clinical reasoning – The Medical Republic
9 minute read
By
In another peek at where we’re up to with AI assisting with our jobs, this research finds that it’s worse at some bits, better at others, and frequently just as good.
Given the same medical data as attending physicians and residents at two large Boston medical centres in 2023, Open AI performed on par with both residents and attendees in diagnostic accuracy, overall clinical reasoning and cannot-miss diagnoses (defined as those posing “imminent threat to life or limb”).
But its clinical reasoning was more frequently wrong than that of the humans.
Twenty virtual clinical presentations were chosen from a set developed to teach and test clinical reasoning – prioritised differential diagnosis and illness script accuracy, problem representations, diagnostic probabilities and management plans. These cases were written and edited by expert practising doctors.
They covered pharyngitis, headache, abdominal pain, cough, dyspnoea, chest pain, or arthralgia. New information was presented at four stages: triage presentation, review of systems, physical examination, and four diagnostic testing.
Twenty-one attending doctors and 18 residents got one randomly selected case each. The AI (a large language model, freely available online) did all 20.
The responses were then marked for overall “evaluation of demonstration of clinical reasoning in the assessment section of admission notes”, including interpretive summary, differential diagnosis, explanation of lead diagnosis and alternative diagnosis explanation. See here for the complete rubric.
In clinical reasoning, ChatGPT-4 scored higher than doctors overall (a median score of 10 out of a possible 10, versus nine for attendings and eight for residents). But incorrect reasoning was more frequently detected in the AI’s answers than in the residents’ answer (in 11 cases vs two for residents, vs 10 for attendees).
For those playing along at home, this is what the physicians and ChatGPT-4 were asked to do:
You are an internal medicine clinician that is an expert at clinical reasoning. You are caring for the patient in the following case. You will be provided with four sections of the medical case scenario as it unfolds. For each section, provide a problem representation or a one sentence summary statement that highlights the most important elements from the case so far. Also provide your prioritized differential diagnosis with justification.
Please document your thinking just as you would if you were in a real healthcare setting so that we can evaluate your clinical reasoning.
And here is one of the presentations given to residents, attendees and AI.
A 21-year-old male wrestler presents with a 6-week history of progressive right knee pain that is dull in nature, exacerbated by bending and prolonged standing, and more pronounced at the end of the day, without significant trauma but associated with a limp.
ChatGPT-4 suggested the following differential diagnosis with justification:
“Further history, physical examination, and potentially imaging will be key in narrowing down this differential diagnosis and arriving at a definitive diagnosis,” said ChatGPT-4.
The human offered “Problem representation: Monoarticular arthropathy”, with the following differential diagnosis/justification:
“Illness script is very limited (no review of systems or more advanced list of pertinent positives or negatives) so the differential is still extremely broad. However, based on the limited information (young male, presumably sexually active, wrestler) would consider musculoskeletal etiology or infectious etiology (sexually transmitted vs other endemic process, i.e. Lyme) at present. Rheumatologic process not exonerated based on the above.”
Participants were then given more information on the case:
A 21-year-old male with a history of potential exposure to Lyme in an endemic area and an unexplained rash, presenting with a significant right knee effusion, with MRI evidence of synovitis and a joint aspirate showing an inflammatory process with a predominance of polymorphonuclear cells but no evidence of infection or crystal-induced arthritis.
ChatGPT-4 offered the following options:
Justification: The knee effusion and synovitis might result from repeated minor trauma, especially in an active individual like a wrestler. However, this doesn’t explain the rash or potential exposure history.
“At this point, Lyme arthritis remains at the top of the differential, given the constellation of symptoms, history, and investigations. Serologic tests for Lyme disease (like ELISA followed by Western blot) should be considered to confirm the diagnosis. The presence of synovitis without evident infection, the polymorphonuclear-dominant cell count in the joint fluid, the history of potential erythema migrans, and the exposure history all strengthen the suspicion of Lyme arthritis,” said ChatGPT-4.
The human said:
“Problem representation: Monoarticular arthropathy Differential Diagnosis / Justification: Illness script is very limited (no review of systems or more advanced list of pertinent positives or negatives) so the differential is still extremely broad. However, based on the limited information (young male, presumably sexually active, wrestler) would consider musculoskeletal etiology or infectious etiology (sexually transmitted vs other endemic process, i.e. Lyme) at present. Rheumatologic process not exonerated based on the above.”
Despite the ability for overall clinical reasoning demonstrated above, AI also came up with some real clangers. For example:
Everybody scored highly for clinical diagnosis. ChatGPT-4 had the same score as residents for cannot-miss diagnoses (67%) with attendees scoring 50%.
Refreshingly, all the data from this study is freely provided by the authors to anyone who wants it. That’s part of setting the example for how we need to engage with AI in the health setting, author Dr Alan Rodman, a general internist at Beth Israel Deaconess Medical Center in Boston (USA) and co-director of the Innovations in Media and Education Delivery (iMED) Initiative, told TMR.
“I’m not an AI researcher (or at least, not a computer scientist or informatician). I’m a practicing physician who researches how human physicians think,” Dr Rodman said.
“What I’m hoping to do is to encourage the highest clinical standards in AI research ― focusing on actual workflows, clinically meaningful interventions, and holding us to the standards that we do in other parts of clinical research (leading up to RCTs). In this case, as an experiment ‘Turing test’ type study, having our data open is part of that.”
JAMA Intern Med 2024, online 1 April
Receive daily updates on the latest news affecting Australian GPs
End of content
No more pages to load