General-purpose LLMs can be used to track true critical findings – AuntMinnie


General-purpose large language models (LLMs), such as GPT-4, can be adapted to detect and categorize multiple critical findings within individual radiology reports, using minimal data annotation, researchers have reported.
A team led by Ish Talati, MD, of Stanford University, with colleagues from the Arizona Advanced AI and Innovation (A3I) Hub and Mayo Clinic Arizona, retrospectively evaluated two “out-of-the-box” LLMs — GPT-4 and Mistral-7B — to see how well they might perform at classifying findings indicating medical emergency or requiring immediate action, among others. Their results were published on September 10 in the American Journal of Roentgenology.
Timely critical findings communication can be challenging due to the increasing complexity and volume of radiology reports, the authors noted. “Workflow pressures highlight the need for automated tools to assist in critical findings’ systematic identification and categorization,” they said.
The study demonstrated that few-shot prompting, incorporating a small number of examples for model guidance, can aid general-purpose LLMs in adapting to the medical task of complex categorization of findings into distinct actionable categories.
To that end, Talati and colleagues evaluated GPT-4 and Mistral-7B on more than 400 radiology reports selected from the MIMIC-III database of deidentified health data from patients in the intensive care unit (ICU) at Beth Israel Deaconess Medical Center from 2001 to 2012. 
Analysis included 252 radiology reports of varying modalities (56% CT, ~30% radiography, 9% MRI, for example) and anatomic regions (mostly chest, pelvis, and head). 
The reports were divided into a prompt engineering tuning set of 50, a holdout test set of 125, and a pool of 77 remaining reports used as examples for few-shot prompting. An external test set consisted of 180 chest x-ray reports extracted from the CheXpert Plus database.
With a board-certified radiologist and software separately, manual reviews of the reports classified them at consensus into one of three categories:
The models analyzed the submitted report and provided structured output containing multiple fields, listing model-identified critical findings within each of the three categories, according to the group. Evaluation included automated text similarity metrics (BLEU-1, ROUGE-F1, G-Eval) and manual performance metrics (precision, recall) in the three categories.
Precision and recall comparison for LLMs tracking true critical findings
Type of test set and classification
GPT-4
Mistral-7B
Precision
90.1%
75.6%
80.9%
34.1%
80.5%
41.3%
82.6%
75%
76.9%
33.3%
70.8%
34%
Recall
86.9%
77.4%
85%
70%
94.3%
74.3%
98.3%
93.1%
71.4%
92.9%
85%
80%
“GPT-4, when optimized with just a small number of in-context examples, may offer new capabilities compared to prior approaches in terms of nuanced context-dependent classifications,” Talati and colleagues wrote. “This capability is crucial in radiology, where identification of findings warranting referring clinician alerts requires differentiation of whether the finding is new or already known.”
Though promising, further refinement is needed before clinical implementation, the group noted. In addition, the group highlighted a role for electronic health record (EHR) integration to inform more nuanced categorization in future implementations.
Furthermore, additional technical development remains required before potential real-world applications, the group said.
See all metrics and the complete paper here.

source

Jesse
https://playwithchatgtp.com