Is AI better than your doctor? A new study tests the ability of AI to get the right diagnosis – Interesting Engineering

Credit: Pixabay / geralt 
By subscribing, you agree to our and You may unsubscribe at any time.
Much has been made of fears for the coming age of Artificial Intelligence. While there are many potential negative outcomes, from mass unemployment and the deskilling of humans to the extreme of robots taking over, AI could also lead to dramatic improvements in many areas, including medicine.
A new study from researchers at Mass General Brigham takes a step towards figuring out how AI could aid doctors. The study looked at 36 different clinical vignettes and found that ChatGPT was nearly 72% accurate overall in clinical decision making.
The researchers, who published their results in a new paper, utilized the large-language model (LLM) artificial intelligence chatbot and observed that it did equally well in primary care and emergency environments.
The goal for the team was to understand whether ChatGPT could successfully get through a full clinical encounter with a patient, from a diagnostic checkup to making decisions on clinical management and establishing the final diagnosis.
In a press release, the paper’s corresponding author Marc Succi, MD of Mass General Brigham, equated the efficacy of the ChatGPT "doctor" to someone like an intern or a resident.
Certainly, there’s more work to be done until AI reaches the capabilities and knowledge of an experienced physician, although a recent study found doctors achieving nearly the same rate of about 71% accuracy on diagnoses. As Dr. Succi stated, their study “tells us that LLMs in general have the potential to be an augmenting tool for the practice of medicine and support clinical decision making with impressive accuracy."
The study involved pasting portions of 36 standardized and previously-published patient cases into ChatGPT. Analyzing such factors as the patient’s age, gender, symptoms, and how much of an emergency the case was, the AI was tasked with making an initial diagnoses.
After completing that step, ChatGPT received additional information and was tasked with making a final diagnosis and deciding on how to manage the patient’s. In this way, the whole process of what it would be like to see a real patient was simulated.
The researchers gave the AI points for correct answers at each stage, looking at how well it did in differential diagnosis, diagnostic testing, management, and final diagnosis. What they discovered is that ChatGPT achieved nearly 72% accuracy overall, doing its best at making final diagnoses, where its accuracy was 77%.
The AI did worst in differential diagnoses, with its accuracy dropping to only 60% when it needed to chose between two or more conditions with similar symptoms. Clinical management decisions like picking the medicine to give in each case were also not its strong suit, coming in at about 68% accuracy.
Dr. Succi shared that this relative lack of success in differential diagnosis by ChatGPT is important to note “because it tells us where physicians are truly experts and adding the most value — in the early stages of patient care with little presenting information, when a list of possible diagnoses is needed."
Credit: Journal of Medical Internet Research 
Interesting Engineering emailed Arya Rao, PhD candidate at Harvard/MIT and co-author of the study, for more insight on the paper.
The scientist addressed how their team would compare the accuracy of ChatGPT to human doctors by sharing that the patient vignettes they used were originally meant to test the knowledge of healthcare providers. In that regard, being 100% correct would be the "gold-standard of clinical decision making” and a baseline against how they evaluated ChatGPT's performance.
Rao explained that there are some intangible factors that are part of interacting with a human doctor. In particular, Rao stated, “the patient-doctor relationship is crucial; empathy, observation, and critical thinking are all necessary in clinical care, and none of these can be performed by GPT.” The role of AI, wrote Rao, “is to assist physicians, not replace them.” 
One aspect of that assistance could be in carrying out time-consuming administrative tasks like writing notes and processing billing, taking that burden off the doctors. AI can also use its powerful data-processing capabilities to achieve greater efficiency in searching electronic health records and medical databases for data and information used to arrive at a diagnosis or treatment plan.
But will AI surpass human doctors at some point? Ultimately, Rao said, “I don't think that AI vs physicians is a fair comparison; the utility of AI is to augment physicians in their workflows."
Rao also addressed why ChatGPT struggled with differential diagnosis, explaining that ChatGPT is generally designed only to act on information it is given. To answer question on differential diagnosis, however, it was provided with less information, which could have been a factor in its efficacy. It had more data for final diagnosis questions, where it did better.
Check out the study “Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study,” published in the Journal of Medical Internet Research. 
Background:
Large language model (LLM)–based artificial intelligence chatbots direct the power of large training data sets toward successive, related tasks as opposed to single-ask tasks, for which artificial intelligence already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as artificial physicians, has not yet been evaluated.
Objective:
This study aimed to evaluate ChatGPT’s capacity for ongoing clinical decision support via its performance on standardized clinical vignettes.
Methods:
We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared its accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. Accuracy was measured by the proportion of correct responses to the questions posed within the clinical vignettes tested, as calculated by human scorers. We further conducted linear regression to assess the contributing factors toward ChatGPT’s performance on clinical tasks.
Results:
ChatGPT achieved an overall accuracy of 71.7% (95% CI 69.3%-74.1%) across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI 67.8%-86.1%) and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI 54.2%-66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=–15.8%; P<.001) and clinical management (β=–7.4%; P=.02) question types.
Conclusions:
ChatGPT achieves impressive accuracy in clinical decision-making, with increasing strength as it gains more clinical information at its disposal. In particular, ChatGPT demonstrates the greatest accuracy in tasks of final diagnosis as compared to initial diagnosis. Limitations include possible model hallucinations and the unclear composition of ChatGPT’s training data set.

source

Jesse
https://playwithchatgtp.com