ChatGPT Makes OK Clinical Decisions—Usually – IEEE Spectrum
IEEE websites place cookies on your device to give you the best user experience. By using our websites, you agree to the placement of these cookies. To learn more, read our Privacy Policy.
But don’t think about replacing your doctor with a chatbot now, or ever
Could ChatGPT someday assist doctors in diagnosing patients? It might one day be possible.
In a recent study, researchers fed ChatGPT information from fictional patients found in a online medical reference manual to find out how well the chatbot could make clinical decisions such as diagnosing patients and prescribing treatments. The researchers found that ChatGPT was 72 percent accurate in its decisions, although the bot was better at some kinds of clinical tasks than others. It also showed no evidence of bias based on age or gender. Though the study was small and did not use real patient data, the findings point to the potential of chatbots to help make medical care more efficient and less biased.
“This study was looking at GPT’s performance throughout the entire clinical scenario,” said Marc Succi, the associate chair of innovation and commercialization at Mass General Brigham, a health care system in the Boston area, and the senior author of the study.
Published in the Journal of Medical Internet Research on 22 August, the study used all 36 clinical vignettes from the Merck Manual, an online medical reference manual, as patients for ChatGPT to go through the process of diagnosing and treating. Clinical vignettes are patient case studies that are used to help train health care professionals critical thinking and decision making skills while caring for patients.The researchers input the text of each vignette, then ran through the questions presented in the manual for each case. The researchers chose to exclude any questions about examining images, because ChatGPT is text-based.
“I think that well-tested and designed chat programs can be an aid to physicians; they should never replace physicians.” —Paul Root Wolpe, director of the Center for Ethics at Emory University
Researchers first directed the bot to generate a list of differential diagnosesbased on the vignette—in other words, a list of possible diagnoses that can’t be initially dismissed. The chatbot was then asked to suggest which tests should be performed, followed by a request for a final diagnosis. Finally, researchers asked ChatGPT what treatment or follow-up care the patient should receive. Some of the questions from the manual also asked ChatGPT about the medical details of each case, which weren’t necessarily relevant to recommending clinical care.
Overall, ChatGPT gave responses that were 72 percent accurate, but the accuracy varied depending on the type of clinical task. The task that the chatbot was most effective at was accurately making a final diagnosis once it was given both the initial patient information and additional diagnostic testing results, with a 77 percent success rate. Questions designated as “miscellaneous,” which asked about medical details of each case, achieved a similar accuracy at 76 percent.
However, the chatbot wasn’t as effective at completing other types of clinical tasks. It was about 69 percent effective at both recommending the correct diagnostic tests for the initial patient description and prescribing treatment and follow-up care once it made a final diagnosis. ChatGPT fared the worst when it came to differential diagnosis, with only 60 percent accuracy.
Succi said he wasn’t surprised that the chatbot struggled the most with differential diagnosis. “That’s really what medical school and residency is—it’s being able to come up with good differentials with very little presenting information,” he said.
Succi said there is still a long way to go before chatbots might be a routine part of the clinical work of doctors. ChatGPT itself may never play that role, said James Chow, an associate professor of radiation oncology at the University of Toronto who was not involved with study. Because of the way ChatGPT works, he said, it’s impossible to fully know or control how data is used or the way the bot presents it. In his research, Chow is working to develop a medical chatbot that is more specifically trained to handle and present medical information.
Even if specialized chatbots someday act as assistants in a doctor’s office, they should never replace a human doctor, said Paul Root Wolpe, the director of the Center for Ethics at Emory University in Atlanta, who was not involved with the study.
“I think that well-tested and designed chat programs can be an aid to physicians; they should never replace physicians,” Wolpe said. Like any medical technology, Wolpe said that a clinical-trial process would be needed to determine if technology like chatbots can be used with actual patients.
One advantage of using a chatbot like ChatGPT might be a reduction in medical bias. In the study, researchers didn’t find evidence of any difference in the program’s responses relative to a patient’s age or gender, which were given in each vignette. However, Wolpe said that bias could still show up in the responses of bots in cases where data and medical research itself is biased. Some examples might be pulse oximeter readings on people with darker skin, or heart attack symptoms in women, which studies have shown are less likely to be what people think of as “typical” heart attack symptoms.
The study has several limitations, including that it didn’t use actual patient data, and only included a small number of (fictional) patients. The fact that the researchers don’t know how ChatGPT was trained is also a limitation, said Succi, and that though the results are encouraging, chatbots won’t be replacing your doctor anytime soon. “Your physician isn’t going anywhere,” he said.
Rebecca Sohn is a freelance science journalist. Her work has appeared in Live Science, Slate, and Popular Science, among others. She has been an intern at STAT and at CalMatters, as well as a science fellow at Mashable.