Virtual case reasoning and AI-assisted diagnostic instruction: an empirical study based on body interact and large language models – BMC Medical Education

Advertisement
BMC Medical Education volume 25, Article number: 1493 (2025)
7
Metrics details
Integrating large language models (LLMs) with virtual patient platforms offers a novel approach to teaching clinical reasoning. This study evaluated the performance and educational value of combining Body Interact with two AI models, ChatGPT-4 and DeepSeek-R1, across acute care scenarios.
Three standardized cases (coma, stroke, trauma) were simulated by two medical researchers. Structured case summaries were input into both models using identical prompts. Outputs were assessed for diagnostic and treatment consistency, alignment with clinical reasoning stages, and educational quality using expert scoring, AI self-assessment, text readability indices, and Grammarly analysis.
ChatGPT-4 performed best in stroke scenarios but was less consistent in coma and trauma cases. DeepSeek-R1 showed more stable diagnostic and therapeutic output across all cases. While both models received high expert and self-assessment scores, ChatGPT-4 produced more readable outputs, and DeepSeek-R1 demonstrated greater grammatical precision.
Our findings suggest that ChatGPT-4 and DeepSeek-R1 each offer unique strengths for AI-assisted instruction. ChatGPT-4’s accessible language may better support early learners, whereas DeepSeek-R1 may be more aligned with formal clinical reasoning. Selecting models based on specific teaching goals can enhance the effectiveness of AI-driven medical education.
Peer Review reports
With the rapid digitalization and intelligent transformation of medical education, virtual patient (VP) systems have emerged as immersive tools that simulate realistic clinical scenarios, enabling students to practice diagnostic reasoning and decision-making in a safe and repeatable environment [1,2,3]. Interactive platforms such as Body Interact simulate realistic clinical scenarios, allowing students to engage in diagnostic reasoning and decision-making within a safe, repeatable environment. These systems have been widely adopted and validated by medical education institutions across several countries, demonstrating their effectiveness in enhancing clinical thinking and adaptability [4,5,6].
In parallel, large language models (LLMs) like ChatGPT-4 and DeepSeek-R1 have shown remarkable progress in medical text understanding and generation [7,8,9]. These models have shown promising capabilities in supporting clinical decision-making, case analysis, and medical text generation. In the context of medical education, studies have demonstrated that ChatGPT-4 can simulate physician-patient interactions, explain pathophysiological mechanisms, and propose preliminary diagnoses, at times performing on par with professional physicians [10]. For instance, GPT-4 has outperformed emergency physicians in diagnostic accuracy [11], achieved consultant-level performance in inpatient neurology [12], and reached a 57% success rate in complex diagnostic tasks, surpassing 99.98% of human online responses [13]. Additionally, it has achieved notable results in spine disease classification [14], neuroradiology interpretation, and USMLE-style examinations [15, 16]. These findings underscore the model’s potential for personalized instruction and clinical reasoning training. However, limitations remain—for example, its performance on the Taiwanese Traditional Chinese Medicine licensing exam lagged behind human candidates, reflecting challenges in adapting to specialized knowledge domains [17].
As a domestically developed LLM in China, DeepSeek-R1 has been deployed across Hundreds of tertiary hospitals since 2024, showing strong performance in clinical question answering and reasoning tasks [18]. It achieved a 93% accuracy rate on the MedQA benchmark [19] and demonstrated favorable performance and cost-effectiveness in ophthalmic clinical decision-making [20]. The model has also been applied extensively in research support, patient management, and hospital operations, contributing to the broader development of intelligent healthcare systems and integrated medical education frameworks [21, 22].
Despite these advancements, most existing studies assess LLMs and VP systems in isolation, without exploring their integration in actual teaching contexts. In particular, few empirical studies have evaluated how combining LLMs with virtual simulations might support structured clinical reasoning instruction. Moreover, medical students still face restricted patient exposure and lack objective, standardized tools for reasoning training and assessment. This limits the scalability and effectiveness of competence-based curricula.
Therefore, integrating LLMs into VP-based instruction may offer a transformative approach, enabling reproducible, personalized, and measurable reasoning experiences. Such integration not only supports real-time feedback and reflective learning but also lays the groundwork for developing AI-assisted evaluation frameworks that align with the evolving demands of intelligent and interdisciplinary medical education.
This study aims to address several pressing challenges in current medical education, including limited access to structured clinical reasoning training, the lack of standardized criteria for evaluating reasoning quality, and the absence of clear pathways for integrating artificial intelligence (AI) technologies into instructional workflows. Specifically, it explores the potential application of LLMs within virtual patient systems. The primary objectives are as follows:
To investigate the value of AI tools in supporting virtual case-based clinical reasoning, focusing on their ability to generate diagnostic and treatment suggestions, simulate structured clinical thinking, and offer textual feedback during simulated diagnostic tasks.
To compare the overall performance of ChatGPT-4 and DeepSeek-R1 in virtual clinical reasoning scenarios, assessing differences in reasoning accuracy, language clarity, logical coherence, and domain-specific responsiveness.
To examine the consistency between AI-generated diagnostic outputs and feedback from the virtual patient system, and to explore their educational relevance in promoting clinical thinking, reflective learning, and error awareness within simulated environments.
To develop a practical “virtual reality (VR) + AI” integrated instructional model, embedding LLMs into the virtual diagnostic workflow and designing representative interactive learning pathways as part of a future-oriented, intelligent, and interdisciplinary medical education framework.
This research presents several novel contributions:
It is the first empirical study to systematically compare mainstream large language models within a simulated instructional context using a virtual patient platform, from a medical education research perspective.
It proposes an exploratory, multidimensional evaluation framework for assessing clinical reasoning quality, contributing to the objective analysis of AI-assisted instructional interventions.
It introduces an “AI-assisted + virtual patient” hybrid instructional approach, offering a scalable and theoretically grounded model to support the intelligent transformation and interdisciplinary advancement of medical education.
This study was conducted using the Body Interact virtual patient platform (trial version) in conjunction with two large language models (LLMs): ChatGPT-4, developed by OpenAI, and DeepSeek-R1, developed by DeepSeek. A virtual case–based reasoning and AI-assisted diagnostic teaching protocol was designed and implemented to explore the integration of artificial intelligence in clinical reasoning instruction.
This section provides a systematic overview of the functional features of the selected platforms and models, as well as their specific roles and applications within the instructional workflow.
Body Interact is a clinically oriented virtual simulation platform widely used in medical education and clinical reasoning training (https://bodyinteract.com/). It offers an interactive, time-sensitive clinical environment in which learners can assess virtual patients, perform examinations, interpret diagnostic results, and initiate therapeutic interventions. All clinical cases are developed based on evidence-based guidelines and are supported by a dynamic physiological feedback system, designed to replicate the complexity of real-world clinical decision-making.
Due to access limitations of the trial version, only three standardized acute care cases: coma, stroke, and trauma, were available. These cases were selected for their high clinical frequency, diagnostic complexity, and pedagogical value in modeling clinical reasoning in urgent care contexts. Each represents a distinct type of critical condition and forms the basis for AI-assisted clinical reasoning analysis in this study
A case involving a patient with hypoglycemia-induced loss of consciousness. The instructional focus is on the rapid identification of metabolic etiologies and the exclusion of other potential causes such as intracranial injury or toxic exposure. This case is designed to reinforce the differential diagnosis and management process for metabolic coma.
Patient: Male, 61 years old; Weight: 83.0 kg; Height: 185 cm; BMI: 24.3.
Chief Complaint: Sudden loss of consciousness, regained alertness upon arrival.
History of Present Illness: Collapsed at the office and was brought to the emergency department by colleagues. The patient later regained consciousness and walked into the ED independently, complaining of generalized weakness and malaise.
Background: Worked late the previous night; skipped lunch due to intense focus on a project.
A suspected case of acute ischemic stroke, emphasizing the importance of rapid recognition protocols (e.g., FAST), use of NIHSS scoring, and decision-making within thrombolysis or transfer time windows.
Patient: Female, 75 years old; Weight: 75.0 kg; Height: 160 cm; BMI: 29.3.
Chief Complaint: Right upper limb weakness and slurred speech.
History of Present Illness: Found on the floor at home by her daughter, with complaints of right arm weakness, expressive difficulty, and disorientation. Neurological examination revealed right-sided motor weakness and dysarthria.
Background: History of hypertension, taking antihypertensive (Carvedilol) and antiplatelet (Aspirin) medications. Occasional palpitations reported but not regularly treated.
A trauma case involving a patient with a chest stab wound leading to tension pneumothorax. The educational focus lies in the rapid recognition of life-threatening thoracic injuries and the initiation of emergency decompression or drainage.
Patient: Male, 27 years old; Weight: 83.0 kg; Height: 175 cm; BMI: 27.1.
Chief Complaint: Penetrating chest trauma.
History of Present Illness: Attacked while walking near his residence and sustained a stab wound while attempting to escape.
Background: No detailed past medical history available; emergency management must prioritize assessment of injury location, hemorrhage, and hemodynamic stability.
While the number and diversity of cases were limited, they provided a structured framework for simulating real-world diagnostic scenarios and enabled systematic comparison of AI model performance in clinical decision support.
To support learners in diagnostic reasoning and promote reflective learning with personalized feedback, this study incorporated two generative AI language models:
ChatGPT-4 is a large-scale language model based on the Transformer architecture, trained on a wide range of internet content and academic texts. It demonstrates strong capabilities in medical dialogue and clinical communication. In this study, ChatGPT-4 was accessed via the official OpenAI platform (https://chat.openai.com), using the default GPT-4 Turbo configuration for task execution. The model was applied to generate diagnostic suggestions, analyze reasoning pathways, and assess clinical decision-making processes.
DeepSeek-R1 is a bilingual (Chinese-English) large language model developed by DeepSeek, designed to support general-purpose text generation, complex reasoning, and domain-specific applications. Accessed through its official platform (https://www.deepseek.com), DeepSeek-R1 was utilized in this study to generate clinical reasoning outputs, including differential diagnosis suggestions, pathophysiological explanations, and evaluations of clinical decision-making pathways. Its dual-language capabilities also allowed for preliminary exploration of potential use in bilingual instructional contexts.
Both models operated independently, analyzing diagnostic summaries and management notes submitted by researchers within the virtual patient cases. To ensure comparability of outputs, standardized AI prompts were used across all tasks. Clinical accuracy, contextual relevance, and pedagogical value of model-generated content were subsequently assessed by medical experts. Despite differences in backend architecture and platform interfaces, measures were taken to standardize usage conditions and minimize interface-based bias.
This study did not involve real students or clinical learners. Instead, it adopted an empirical research design involving two researchers with medical backgrounds. Each researcher independently operated a separate computer terminal to conduct the virtual case simulations in parallel. A standardized protocol was followed to ensure consistency across case simulation, information extraction, AI model input, and feedback verification. This two-researcher design aimed to reduce procedural bias and minimize the risk of confirmation bias associated with single-operator studies, thereby enhancing the reliability of the AI output evaluation process.
The primary goal was to conduct a preliminary feasibility assessment of integrating AI-based reasoning tools into virtual patient simulations, rather than to evaluate educational effectiveness in learners.
The experimental process consisted of five distinct stages, as outlined below:
The researcher logged into the Body Interact platform and sequentially engaged with the three selected representative cases (coma, stroke, and trauma). For each case, the researcher systematically recorded key clinical information including chief complaint, history of present illness, physical examination findings, vital signs, and relevant laboratory results. These details were compiled into a standardized case summary, which served as the input material for the subsequent AI model queries.
The standardized case summaries were then input into the two large language models: ChatGPT-4 and DeepSeek-R1. Each model generated outputs including initial differential diagnoses, recommended diagnostic tests, and suggested treatment plans. All generated content was preserved in its original form and categorized for further analysis. To ensure reproducibility and statistical validity, each AI model processed all three cases consecutively in one cycle and repeated this cycle three times. All outputs were then incorporated into the final comparative evaluation.
Furthermore, to minimize potential biases from conversation memory or prompt dependency, multiple control measures were implemented. All procedures were conducted on the same device and under identical network conditions by independent researchers using new, untrained accounts. In addition, chat histories were cleared after each case to prevent residual contextual influence.
Both AI models received identical English-language prompts in a standardized format. Each case was presented using a structured bullet-point layout derived from the Body Interact platform. After inputting the case information, a consistent instruction was appended: “Please provide an initial differential diagnosis and recommend the next steps for examination and treatment.” No bilingual customization or fine-tuned prompts were applied for DeepSeek-R1, and both models were reset between cases to maintain consistent baseline conditions.
Based on the diagnostic and treatment suggestions provided by the AI models, the researcher performed the corresponding clinical actions within the Body Interact platform, including examinations, medication administration, and procedural interventions. To validate the practical relevance of these AI-generated recommendations, the resulting outcomes were compared against the platform-defined gold standards. The accuracy, timeliness, feasibility, and clinical coherence of each AI suggestion were evaluated through the simulated physiological responses and system feedback from the virtual patients, thereby assessing the decision completeness and overall practical applicability of the models in a controlled clinical simulation environment.
A dual-layered evaluation system was developed to assess the quality of AI outputs. Two senior clinicians (with over 10 years of experience in urology and obstetrics & gynecology, respectively) independently scored the AI-generated diagnostic and therapeutic content in terms of medical accuracy, logical coherence, and clinical relevance. In addition, two educators (each with over 9 years of teaching experience) reviewed the scoring outcomes. To assess the consistency of expert ratings, interrater reliability was calculated using the ICC (intraclass correlation coefficient), based on scores submitted by the two evaluators. The analysis was performed via the online tool (https://www.statstodo.com/IntraclassCorrelation.php). Parallel to expert evaluation, the built-in scoring features of the AI models were used to assess the completeness and plausibility of the outputs, adding an objective layer to the performance analysis.
To evaluate the educational usability of the AI-generated texts, the Readability Scoring System Plus was employed (https://readabilityformulas.com/readability-scoring-system.php), with two key indicators: the SMOG Index (TSI) and Gunning Fog Index (GFI), which reflect the comprehension difficulty and audience appropriateness of the text, respectively. Additionally, the Grammarly writing assistant was used to assess the generated content in terms of logical coherence, grammatical accuracy, and clarity of expression. These metrics collectively informed a comprehensive judgment of the linguistic quality and pedagogical suitability of AI outputs in medical education contexts.
To comprehensively assess the performance of AI models in virtual case reasoning and diagnostic support tasks, this study established a multi-dimensional evaluation framework comprising three core components: expert-based manual scoring, AI self-assessment, and textual quality analysis. All evaluation approaches, including manual expert scoring, AI self-assessment, and standardized metrics of textual readability (e.g., GFI, TSI) and language quality, were treated as independent indicators. No weighting was applied across these dimensions, as each was intended to capture a distinct aspect of AI performance. This multi-dimensional evaluation framework enables a comprehensive assessment of AI outputs from clinical, linguistic, and textual quality perspectives.
A structured scoring rubric was employed to quantitatively evaluate the diagnostic and management suggestions generated by the AI models. The evaluation dimensions included: Timeliness of Response; Accuracy of Primary Diagnosis; Rationality of Recommended Examinations; Standardization of Therapeutic Suggestions; Capacity to Integrate Feedback and Revise Diagnosis within Simulated Scenarios.
The scoring criteria were initially developed by two senior clinicians with over 10 years of experience in urology and obstetrics & gynecology, and subsequently reviewed and refined by two educators with more than 9 years of teaching experience. This ensured that the evaluation standards were both scientifically grounded and pedagogically relevant.
Details of the scoring rubric and evaluation dimensions are provided in Table 1 (Table 1).
Each AI model’s built-in self-assessment functionality was utilized to generate a systematic self-evaluation of its task performance. The assessment dimensions included: Content Accuracy; Completeness of Information; Logical Consistency of the Reasoning Process; Clinical Applicability of Suggestions; Potential Safety Risks; Citation and Use of Medical Knowledge Sources.
The scoring rubric and evaluation metrics for AI self-assessment are detailed in Table 2 (Table 2).
To assess the educational readability and linguistic quality of AI-generated outputs, this study adopted two primary categories of indicators:
Two mainstream readability indices were employed to quantitatively evaluate the difficulty level of the text and determine its suitability for medical learners:
Developed by G. Harry McLaughlin in 1969, the SMOG Index is a widely used tool for estimating the number of years of education required to understand a given English text [45]. It calculates readability based on the number of polysyllabic words (i.e., words with three or more syllables) within a sample of text. A higher SMOG score indicates greater complexity and reduced comprehensibility.
Due to its stability and predictive accuracy, the SMOG Index has been extensively applied in medical communication, public health, and education. In recent years, it has been particularly favored for assessing the readability of health-related websites, patient education materials, and online health information to ensure accessibility for target audiences, especially individuals with limited health literacy [46].
Prior studies have shown that SMOG offers higher consistency and practical relevance than other traditional indices such as FKGL (Flesch-Kincaid Grade Level) and FRE (Flesch Reading Ease), particularly in evaluating public-facing medical texts. Health communication guidelines generally recommend that educational materials maintain a SMOG score below the 6th–8th grade level to enhance clarity and effectiveness [47].
The SMOG Index is computed using the following formula:
The Gunning Fog Index is another widely used metric for evaluating textual readability. It defines “complex words” as those containing three or more syllables and assesses the difficulty of comprehension based on average sentence length and the proportion of complex words [48].
In general, a GFI score above 12 suggests that the text is relatively obscure and may be difficult for the general public to understand, being more appropriate for readers with higher educational backgrounds. Compared to other indices, GFI is more sensitive to technical terminology and long sentence structures, making it a valuable tool for evaluating policy documents, academic literature, and other highly specialized content.
The GFI is calculated using the following formula:
Utilizing the Grammarly platform (https://app.grammarly.com/), this study automatically scored the AI-generated text in terms of grammatical accuracy, clarity of logical structure, and conformity to academic language norms. These metrics were used to assess the linguistic appropriateness of AI outputs in medical education contexts.
To comprehensively evaluate the performance of AI models in virtual case-based diagnostic tasks, this study established a multi-layered data analysis framework covering content categorization, score-based comparison, and consistency analysis. Specific methods are described as follows:
To investigate the alignment of AI-generated outputs with standard clinical reasoning stages, one representative output was selected from multiple rounds of diagnostic responses produced by ChatGPT-4 and DeepSeek-R1 for each of three clinical cases: coma, stroke, and trauma. These outputs were mapped onto the three core stages of clinical reasoning: hypothesis generation, hypothesis testing, and decision-making. For outputs not subjected to this mapping, manual comparison was conducted against Body Interact reference standards to assess consistency in diagnosis, examination, and treatment. All results were consolidated into a structured table for cross-model analysis.
Based on the Body Interact platform’s physiological responses to AI-recommended interventions (e.g., diagnostic tests or treatments), two independent researchers carried out the suggested actions and evaluated each recommendation for feasibility and effectiveness. For each virtual case, AI tasks were repeated three times, and performance was assessed using indicators such as correct diagnosis count, recommendation concordance rate for diagnostic tests, and therapeutic intervention concordance.
To quantify these aspects, diagnostic, examination, and treatment consistency were defined as the percentage agreement between AI-generated outputs and the corresponding reference standards provided by the Body Interact virtual patient platform.
The performance of the two AI models was quantitatively compared across several dimensions: expert human scoring, AI self-evaluation, text readability scores (TSI and GFI), and language quality scores. All scoring data were first subjected to normality testing (Normality and Lognormality Tests). If the data followed a normal distribution, Welch’s t-test was applied to assess mean differences between the models across each indicator. This test accounts for unequal variances between samples and ensures robustness and reliability in statistical inference. For non-normally distributed data, the Mann–Whitney U test was used to compare median differences between groups. In addition, to evaluate the consistency and objectivity of expert ratings, interrater reliability was calculated using the intraclass correlation coefficient (ICC), with 95% confidence intervals computed for each model.
To evaluate how well AI-generated responses reflect structured clinical thinking, a qualitative mapping of AI outputs to standard clinical reasoning stages was conducted. As shown in Fig. 1, the coma case from ChatGPT-4 is presented as an example to illustrate how the model’s response aligns with key stages of clinical reasoning.
Mapping of ChatGPT-4 output to clinical reasoning stages (coma case)
This study selected three representative acute medical scenarios: coma, acute ischemic stroke, and trauma, as test cases. For each case type, ChatGPT-4 and DeepSeek-R1, two large language models (LLMs), independently conducted three rounds of diagnostic reasoning. The outputs, including preliminary diagnoses, recommended investigations, and treatment suggestions, were compared against the standard feedback provided by the Body Interact virtual simulation system. Results are summarized in Table 3(Table 3).
As illustrated in the radar chart (Fig. 2A) and the grouped bar chart (Fig. 2B), noticeable differences were observed between ChatGPT-4 and DeepSeek-R1 in terms of consistency across three categories of acute cases: coma, stroke, and trauma. The comparative results cover three core dimensions: diagnostic consistency, examination recommendation consistency, and treatment recommendation consistency.
A Consistency radar chart of AI models across three types of acute cases. B Bidirectional bar chart of dimension-specific consistency performance of AI models across three case types
As shown in Fig. 2A and 2B, consistency levels across diagnostic reasoning, examination suggestions, and treatment recommendations differed between ChatGPT-4 and DeepSeek-R1. In the diagnosis task, ChatGPT-4 achieved perfect consistency in the stroke case (100%) and high consistency in trauma (83.33%), while performance dropped to 0% in the coma scenario. DeepSeek-R1 showed moderate to high diagnosis consistency across all scenarios, with 50% for coma, 83.33% for stroke, and 100% for trauma; For examination recommendation, both models reached full consistency in stroke and trauma. In the coma case, DeepSeek-R1 maintained 100% consistency, while ChatGPT-4 showed slightly reduced agreement (83.33%); The most noticeable differences were observed in treatment recommendation consistency. ChatGPT-4 achieved 100% in stroke, but only 16.67% and 0% in coma and trauma, respectively. In contrast, DeepSeek-R1 demonstrated relatively stable performance, with 100% in coma, 83.33% in stroke, and 66.67% in trauma.
These findings suggest that while both models can perform well in structured tasks such as examination suggestions, DeepSeek-R1 shows more consistent decision-making in diagnosis and treatment, especially in complex scenarios like coma and trauma.
After completing the diagnostic tasks for the three emergency case types, both AI models: ChatGPT-4 and DeepSeek-R1, conducted self-assessments and were evaluated by clinical experts. The results (Table 4) demonstrate that both models achieved high levels of consistency between their self-assessed and expert-assigned scores. Self-assessment scores were 9.0 ± 0.2 for ChatGPT-4 and 9.1 ± 0.3 for DeepSeek-R1 (P = 0.3837), while expert-assigned scores were 87.8 ± 3.9 and 89.1 ± 3.1, respectively (P = 0.2388), with no statistically significant differences between the two models.
To further assess the consistency of expert ratings, interrater reliability was calculated using the intraclass correlation coefficient (ICC). The ICC for ChatGPT-4 evaluations was 0.8285 (95% CI: 0.6005–0.9308), indicating excellent agreement. For DeepSeek-R1, the ICC was 0.7382 (95% CI: 0.3792–0.883), indicating moderate to good agreement (Table 5). These findings support the reliability and objectivity of the manual scoring process.
To systematically evaluate the textual quality of the AI-generated outputs, we employed the Gunning Fog Index (GFI) and the SMOG Index (TSI) to assess text readability, and utilized the Grammarly platform to analyze grammatical and logical accuracy, as shown in Fig. 3A and 3B, and 3C.
A Gunning Fog Index (GFI) Readability Scores. B The SMOG index (TSI) readability scores. C grammarly grammar and logic evaluation scores
As shown in Fig. 3A, the Gunning Fog Index (GFI) scores revealed that ChatGPT-4 produced significantly simpler text compared to DeepSeek-R1 (p = 0.0009), indicating a lower linguistic complexity. This suggests that ChatGPT-4 may provide more accessible outputs for medical learners, especially in early-stage clinical reasoning instruction. Similarly, Fig. 3B demonstrates that ChatGPT-4 also achieved significantly lower scores on the SMOG Index (TSI) than DeepSeek-R1 (p = 0.0008), further supporting its advantage in producing more readable text. These readability differences suggest that ChatGPT-4 may offer pedagogical benefits by reducing cognitive load in educational materials. In terms of grammatical and logical accuracy, Fig. 3C shows that DeepSeek-R1 slightly outperformed ChatGPT-4 in Grammarly scores (p = 0.0193); however, both models remained within a high-accuracy range. While DeepSeek-R1 demonstrated more formal correctness, ChatGPT-4’s favorable balance between text readability and acceptable grammatical precision may be better suited for formative educational contexts requiring clarity and learner engagement.
With the progressive integration of artificial intelligence into medical education, the “virtual patient + large language model” teaching paradigm offers a novel approach to systematically cultivating clinical reasoning skills. This study simulated three categories of acute medical scenarios to compare the overall performance of ChatGPT-4 and DeepSeek-R1 in virtual case-based reasoning. Through this comparison, we preliminarily explored the educational value of AI-assisted instruction, the performance differences between models, and the potential directions for constructing an educational evaluation framework. Several insightful findings have emerged from this investigation.
This study demonstrates that integrating virtual patient systems with large language models (LLMs) creates an effective platform for simulating clinical reasoning in acute care scenarios. The “AI-assisted diagnosis + virtual feedback” approach offers interactive, real-time training that enhances learners’ clinical decision-making visibility and reflective thinking. This model addresses challenges such as limited clinical rotations and subjective feedback, fostering personalized, problem-oriented reasoning development. Notably, ChatGPT-4’s full consistency in stroke cases supports its utility in teaching time-sensitive protocols, while variable performance in trauma highlights ongoing challenges in complex case management. Additionally, ChatGPT-4’s simpler, more readable outputs may reduce cognitive load, aiding early-stage learners, whereas DeepSeek-R1’s higher grammatical precision benefits formal clinical communication.
The comparison reveals complementary strengths: ChatGPT-4 excels in protocol-driven stroke scenarios, achieving perfect consistency, while DeepSeek-R1 performs more robustly across trauma cases, with higher treatment agreement. These differences likely reflect variations in training data and reasoning architecture, suggesting that no single model is superior across all case types. Both models struggled with the coma case, failing to identify hypoglycemia, indicating limitations in handling ambiguous, context-dependent diagnoses. ChatGPT-4’s poor trauma treatment recommendations raise concerns about its reliability in high-acuity settings, emphasizing the need for cautious application.
Although overall scores by experts and AI self-assessments were high and similar, deeper analysis uncovered structural inconsistencies, particularly in ChatGPT-4’s reasoning in coma and trauma cases. This discrepancy highlights the inadequacy of relying solely on aggregate task scores. A multidimensional scoring system incorporating diagnostic accuracy, recommendation consistency, reasoning coherence, and language readability is essential for a more nuanced evaluation of AI’s educational utility. Such a framework can guide curriculum development by integrating language quality and reasoning feedback, fostering critical thinking and structured clinical reasoning skills.
While this study provides preliminary validation of the application potential of large language models in virtual case-based reasoning instruction, several limitations remain.
First, the sample size of clinical cases was limited and focused primarily on acute scenarios (coma, stroke, trauma), excluding more complex cases such as chronic diseases, multimorbidity, or psychiatric conditions. This constrains the generalizability and comprehensiveness of the evaluation of model reasoning capacity.
Second, current AI models generate knowledge based on pre-trained data and lack the capacity to dynamically incorporate updated clinical guidelines, case variations, or regional healthcare disparities. This can lead to outputs that lag behind current clinical practices, posing particular risks in fast-evolving fields such as emergency and critical care medicine.
Third, this study involved only two researchers and did not include actual students or clinical learners, which limits the ecological validity and generalizability of the findings. It did not assess learner-based outcomes, such as diagnostic accuracy, engagement, knowledge gain, or cognitive load, as the focus was on the feasibility and consistency of AI-assisted outputs within the simulation environment rather than direct educational effectiveness. This design limits the ecological validity and generalizability of the findings.
Additionally, this study employed two researchers independently conducting case simulations to reduce confirmation bias and enhance variability in AI input and evaluation. However, the relatively small number of operators and the limited case diversity, partly due to access restrictions of the Body Interact platform, remain important limitations to be addressed in future work.
Regarding the evaluation framework, this study has not yet systematically assigned weights to different scoring criteria because the primary focus was exploratory feasibility rather than developing a definitive scoring tool. Future work may consider assigning weights based on educational or diagnostic significance to enhance assessment precision.
Model comparison is limited by inherent differences in architecture and backend configurations between ChatGPT-4 (accessed via public API without fine-tuning) and DeepSeek-R1 (integrated through a customized bilingual interface). Despite standardized prompts and resets to minimize variability, these uncontrollable technical differences should be considered when interpreting comparative results.
Future research should expand and deepen in the following directions:
Case diversity: Include a broader range of case types across various clinical specialties and complexity levels to fully test model adaptability and performance under edge conditions.
Learner-centered evaluation: Incorporate diverse learner samples and real educational settings, combining subjective student feedback, interaction frequency, and behavioral indicators to assess AI’s true instructional effectiveness. These follow-up studies are currently in the planning stage, including institutional ethics review (IRB) procedures and scenario-based implementation within existing clinical reasoning curricula.
Explainability and controllability: Explore mechanisms to enhance the interpretability and controllability of AI reasoning processes, improving transparency and safety in educational applications. Additionally, future work may refine the evaluation framework by assigning weights to different scoring criteria based on their educational or diagnostic significance to enhance assessment precision.
Importantly, future work will be guided by relevant theoretical frameworks such as experiential learning and cognitive apprenticeship to ensure that the tool’s design and evaluation align with established educational paradigms.
It is important to emphasize that AI in medical education should serve as an instructional aid, not a replacement for clinical judgment or educator assessment. Its outputs must be used as heuristic tools and reflective resources, while final clinical decisions remain the responsibility of professionally trained healthcare providers, ensuring both educational rigor and patient safety in real-world practice.
This study systematically compared ChatGPT-4 and DeepSeek-R1 across virtual acute care clinical reasoning tasks, revealing distinct strengths and limitations. Both models effectively aligned with clinical reasoning stages but showed variability in diagnostic and treatment consistency: ChatGPT-4 excelled in stroke scenarios but underperformed in coma and trauma, while DeepSeek-R1 demonstrated more stable performance across all cases. Expert and self-assessment scores corroborated the reliability of evaluations. ChatGPT-4’s simpler, more readable language output may be advantageous for early clinical learning, whereas DeepSeek-R1’s formal accuracy appears better suited to supporting clinical precision. These findings underscore the importance of selecting AI models tailored to specific educational goals, balancing reasoning robustness and pedagogical clarity in AI-assisted medical education.
As this study utilized virtual patient scenarios from the Body Interact platform, no real-world patient data were collected or analyzed. Supporting materials are available from the corresponding author upon reasonable request.
Padilha JM, et al. The integration of virtual patients into nursing education. Simul Gaming. 2025;56(2):178–91. https://doi.org/10.1177/10468781241300237.
Article Google Scholar
Bray K, et al. A pilot study comparing immersive virtual reality simulation and computerized virtual patient simulation in undergraduate medical education. Int J Healthc Simul. 2023. https://doi.org/10.54531/rxca9513.
Article Google Scholar
Brown KM, et al. Curricular integration of virtual reality in nursing education. J Nurs Educ. 2023;62(6):364–73. https://doi.org/10.3928/01484834-20230110-01.
Article Google Scholar
Sałacińska I, et al. A comparative study of traditional high-fidelity (manikin-based) simulation and virtual high-fidelity simulations concerning their effectiveness and perception. Front Med. 2025;12–2025. https://doi.org/10.3389/fmed.2025.1523768.
Martinez FT, et al. Virtual clinical simulation for training amongst undergraduate medical students: A pilot randomised trial (VIRTUE-Pilot). Cureus. 2023;15(10):e47527. https://doi.org/10.7759/cureus.47527.
Article Google Scholar
Watari T, et al. The utility of virtual patient simulations for clinical reasoning education. Int J Environ Res Public Health. 2020;17(15). https://doi.org/10.3390/ijerph17155325.
Hirosawa T, et al. Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases. JMIR Form Res. 2024;8:59267. https://doi.org/10.2196/59267.
Article Google Scholar
Sandmann S, et al. Benchmark evaluation of deepseek large Language models in clinical decision-making. Nat Med. 2025. https://doi.org/10.1038/s41591-025-03727-2.
Article Google Scholar
Tordjman M, et al. Comparative benchmarking of the deepseek large Language model on medical tasks and clinical reasoning. Nat Med. 2025. https://doi.org/10.1038/s41591-025-03726-3.
Article Google Scholar
Lower K, et al. ChatGPT-4: transforming medical education and addressing clinical exposure challenges in the Post-pandemic era. Indian J Orthop. 2023;57(9):1527–44. https://doi.org/10.1007/s43465-023-00967-7.
Article Google Scholar
Hoppe JM, et al. ChatGPT with GPT-4 outperforms emergency department physicians in diagnostic accuracy: retrospective analysis. J Med Internet Res. 2024;26:e56110. https://doi.org/10.2196/56110.
Article Google Scholar
Cano-Besquet S, et al. ChatGPT4’s diagnostic accuracy in inpatient neurology: A retrospective cohort study. Heliyon. 2024;10(24):e40964. https://doi.org/10.1016/j.heliyon.2024.e40964.
Article Google Scholar
Eriksen AV, Möller S, Ryg J. (2024). Use of GPT-4 to diagnose complex clinical cases. 1(1), AIp2300031. https://doi.org/10.1056/AIp2300031.
Hu X, et al. Comparative diagnostic accuracy of ChatGPT-4 and machine learning in differentiating spinal tuberculosis and spinal tumors. Spine J. 2025. https://doi.org/10.1016/j.spinee.2024.12.035.
Article Google Scholar
Horiuchi D, et al. Accuracy of ChatGPT generated diagnosis from patient’s medical history and imaging findings in neuroradiology cases. Neuroradiology. 2024;66(1):73–9. https://doi.org/10.1007/s00234-023-03252-4.
Article Google Scholar
Bicknell BT, et al. ChatGPT-4 omni performance in USMLE disciplines and clinical skills: comparative analysis. JMIR Med Educ. 2024;10:e63430. https://doi.org/10.2196/63430.
Article Google Scholar
Tseng LW, et al. Performance of ChatGPT-4 on Taiwanese traditional Chinese medicine licensing examinations: Cross-Sectional study. JMIR Med Educ. 2025;11:e58897. https://doi.org/10.2196/58897.
Article Google Scholar
Chen J, Miao C. DeepSeek deployed in 90 Chinese tertiary hospitals: how artificial intelligence is transforming clinical practice. J Med Syst. 2025;49(1):53. https://doi.org/10.1007/s10916-025-02181-4.
Article Google Scholar
Moell B, Aronsson FS, Akbar S. Medical reasoning in llms: an In-Depth analysis of deepseek R1. ArXiv Preprint arXiv:2504 00016. 2025. https://doi.org/10.48550/arXiv.2504.00016.
Article Google Scholar
Mikhail D, et al. Performance of DeepSeek-R1 in ophthalmology: an evaluation of clinical Decision-Making and Cost-Effectiveness. MedRxiv. 2025. https://doi.org/10.1101/2025.02.10.25322041.
Zeng D, et al. DeepSeek’s Low-Cost adoption across china’s hospital systems: too fast. Too Soon? Jama. 2025. https://doi.org/10.1001/jama.2025.6571.
Article Google Scholar
Peng Y, et al. From GPT to deepseek: significant gaps remain in realizing AI in healthcare. J Biomed Inf. 2025;163:104791. https://doi.org/10.1016/j.jbi.2025.104791.
Article Google Scholar
Chen J, et al. STAGER checklist: standardized testing and assessment guidelines for evaluating generative artificial intelligence reliability. iMetaOmics. 2024;1(1):e7. https://doi.org/10.1002/imo2.7.
Article Google Scholar
Kim Y, et al. MedExQA: medical question answering benchmark with multiple explanations. 2024. ArXiv Preprint arXiv: 2406.06331. https://doi.org/10.48550/arXiv.2406.06331.
Ye J, et al. Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical Ai. Adv Neural Inf Process Syst. 2024;3794327–94427. https://doi.org/10.48550/arXiv.2408.03361.
Singhal K, et al. Large Language models encode clinical knowledge. Nature. 2023;620(7972):172–80. https://doi.org/10.1038/s41586-023-06291-2.
Article Google Scholar
Singhal K, et al. Toward expert-level medical question answering with large Language models. Nat Med. 2025;31(3):943–50. https://doi.org/10.1038/s41591-024-03423-7.
Article Google Scholar
Organization. W.H. WHO handbook for guideline development, 2nd Edition. 2014;180. Available from: https://www.who.int/publications/i/item/9789241548960.
Organization WH. WHO handbook for guidelines development: supplement: criteria for use of evidence to inform recommendations in World Health Organization guidelines. 2023; Available from: https://www.who.int/publications/i/item/WHO-SCI-QNS-MST-2023.1.
Guyatt G, et al. GRADE guidelines: 1. Introduction-GRADE evidence profiles and summary of findings tables. J Clin Epidemiol. 2011;64(4):383–94. https://doi.org/10.1016/j.jclinepi.2010.04.026.
Article Google Scholar
Schünemann H et al. The GRADE handbook. Handbook for grading the quality of evidence and the strength of recommendations using the GRADE approach 2013; Available from: https://gdt.gradepro.org/app/handbook/handbook.html.
(n.d.), U. UpToDate: Evidence-based clinical decision support. 2024; Available from: https://www.uptodate.com.
Chen L, Jiang WJ, Zhao RP. Application effect of Kolb’s experiential learning theory in clinical nursing teaching of traditional Chinese medicine. Digit Health. 2022;8:20552076221138310. https://doi.org/10.1177/20552076221138313.
Article Google Scholar
Davitadze M, et al. SIMBA: using kolb’s learning theory in simulation-based learning to improve participants’ confidence. BMC Med Educ. 2022;22(1):116. https://doi.org/10.1186/s12909-022-03176-2.
Article Google Scholar
Régent A, Thampy H, Singh M. Assessing clinical reasoning in the OSCE: pilot-testing a novel oral debrief exercise. BMC Med Educ. 2023;23(1):718. https://doi.org/10.1186/s12909-023-04668-5.
Article Google Scholar
Siegelman J, et al. Assessment of clinical reasoning during a high stakes medical student OSCE. Perspect Med Educ. 2024;13(1):629–34. https://doi.org/10.5334/pme.1513.
Article Google Scholar
Cooper N, et al. Consensus statement on the content of clinical reasoning curricula in undergraduate medical education. Med Teach. 2021;43(2):152–9. https://doi.org/10.1080/0142159x.2020.1842343.
Article Google Scholar
Singh M, et al. From principles to practice: embedding clinical reasoning as a longitudinal curriculum theme in a medical school programme. Diagnosis (Berl). 2021;9(2):184–94. https://doi.org/10.1515/dx-2021-0031.
Article Google Scholar
Wijnen-Meijer M, et al. Implementing kolb´s experiential learning cycle by linking real experience, Case-Based discussion and simulation. J Med Educ Curric Dev. 2022;9:23821205221091510. https://doi.org/10.1177/23821205221091511.
Article Google Scholar
Organization WH. Clinical management of COVID-19: living guideline, 18 August 2023. 2023; Available from: https://www.who.int/publications-detail-redirect/WHO-2019-nCoV-clinical-2023.2.
Organization WH. Infection prevention and control in the context of coronavirus disease (COVID-19): a living guideline, 10 August 2023. 2023; Available from: https://www.who.int/publications/i/item/WHO-2019-nCoV-IPC-guideline-2023.2.
Organization WH. Strengthening rehabilitation in health emergency preparedness, readiness, response and resilience: policy brief. 2023; Available from: https://www.who.int/publications/i/item/9789240073432.
Marx N, et al. 2023 ESC guidelines for the management of cardiovascular disease in patients with diabetes. Eur Heart J. 2023;44(39):4043–140. https://doi.org/10.1093/eurheartj/ehad192.
Article Google Scholar
Association AD. 1. Improving care and promoting health in populations: standards of care in Diabetes-2025. Diabetes Care. 2025;48(Supplement1):S14–26. https://doi.org/10.2337/dc25-S001.
Article Google Scholar
Gibson D, et al. Evaluating the efficacy of ChatGPT as a patient education tool in prostate cancer: multimetric assessment. J Med Internet Res. 2024;26:e55939. https://doi.org/10.2196/55939.
Article Google Scholar
Mac O, et al. Comparison of readability scores for written health information across formulas using automated vs manual measures. JAMA Netw Open. 2022;5(12):e2246051. https://doi.org/10.1001/jamanetworkopen.2022.46051.
Article Google Scholar
Raja H, Lodhi S. Assessing the readability and quality of online information on anosmia. Ann R Coll Surg Engl. 2024;106(2):178–84. https://doi.org/10.1308/rcsann.2022.0147.
Article Google Scholar
Onder CE, et al. Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy. Sci Rep. 2024;14(1):243. https://doi.org/10.1038/s41598-023-50884-w.
Article Google Scholar
Download references
We gratefully acknowledge the Body Interact platform for granting permission and providing technical support, which was essential for the smooth implementation of this study. We also thank CNSKnowall for providing material support for data visualization, which enhanced the clarity and impact of our findings.
This research was funded by the Bio & Medical Technology Development Program of the National Research Foundation (NRF), which is funded by the Korean government (MIST) (No. RS-2023-00236157).
Zhao Luo, Yu Seob Shin and Xianxin Li contributed equally.
Guihua Chen and Chuan Lin contributed equally to this work and are considered co-first authors.
Department of Urology, Shenzhen Qianhai Taikang Hospital, Shenzhen, Guangdong, China
Zhao Luo & Xianxin Li
Department of Biomedical Sciences, Institute for Medical Science, Jeonbuk National University Medical School, Jeonju, Jeollabuk-do, Republic of Korea
Guihua Chen
School of Food and Biological Engineering, Luohe Food Engineering Vocational University, Luohe, Henan, China
Guihua Chen & Lijie Zhang
Department of Obstetrics and Gynecology, Beijing Anzhen Nanchong Hospital, Captial Medical University & Nanchong Central Hospital, Nanchong, Sichuan, China
Chuan Lin
Department of Urology, Jeonbuk National University Medical School, Jeonju, Jeollabuk-do, Republic of Korea
Yu Seob Shin
Research Institute of Clinical Medicine of Jeonbuk National University-Biomedical Research Institute of Jeonbuk National University Hospital, Jeonju, Jeollabuk-do, Republic of Korea
Yu Seob Shin
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
[First Author 1 G]: Conceptualization, Data Curation, Virtual case operation (Body Interact platform), AI model execution (ChatGPT-4 and DeepSeek-R1), Writing – Original Draft, Visualization. [First Author 2 C]: Virtual case operation (Body Interact platform), AI model execution (ChatGPT-4 and DeepSeek-R1), Development of scoring criteria and evaluation guidelines, Methodology, Validation, Data analysis, Writing – review and editing. [Second Author L]: Data curation, Writing – review and editing. [Co-Corresponding Author 1 Z]: Development of scoring criteria and evaluation guidelines, Data analysis, Supervision, Project Administration, Writing – Review & Editing. [Co-Corresponding Author 2 Y]: Supervision, Project Administration, Writing – Review & Editing. [Co-Corresponding Author 3 X]: Supervision, Project Administration, Writing – Review & Editing.
Correspondence to Zhao Luo, Yu Seob Shin or Xianxin Li.
This study was based entirely on virtual clinical cases and did not involve human participants, human data, or human tissue. Therefore, ethical approval and informed consent were not required.
Not applicable.
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
Chen, G., Lin, C., Zhang, L. et al. Virtual case reasoning and AI-assisted diagnostic instruction: an empirical study based on body interact and large language models. BMC Med Educ 25, 1493 (2025). https://doi.org/10.1186/s12909-025-07872-7
Download citation
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12909-025-07872-7
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
Collection
Advertisement
ISSN: 1472-6920
By using this website, you agree to our Terms and Conditions, Your US state privacy rights, Privacy statement and Cookies policy. Your privacy choices/Manage cookies we use in the preference centre.
© 2025 BioMed Central Ltd unless otherwise stated. Part of Springer Nature.