AI beats primary care doctors in simulated diagnosis study using images and ECGs
by Hugo Francisco de Souza · News-MedicalA state-aware medical AI system interpreted images, ECGs, and clinical documents during live diagnostic chats, outperforming primary care physicians in simulated consultations while raising urgent questions about how such tools should be tested before real-world care.
Study: Advancing conversational diagnostic AI with multimodal reasoning. Image Credit: Explode / Shutterstock
Study findings revealed that the novel multi-modal model outperformed PCPs across 29 of 32 evaluation axes, including diagnostic accuracy and consultation-quality metrics such as empathy. These findings suggest that multi-modal AI could eventually support remote healthcare delivery, pending real-world validation.
Multi-modal Clinical AI Background
Global healthcare delivery increasingly documents morbidity risks associated with delayed access to healthcare, a pattern experts attribute to mounting pressures from clinician burnout, care fragmentation, and an aging global population. While generative AI has shown potential to mitigate these challenges, early medical large language model (LLM) implementations were largely limited to text-only chatbots.
Reviews in the field highlight that this “text-only” constraint deviates from standard clinical practice, in which much diagnostic information can be derived from history-taking and physical examination, often supplemented by visual data.
These limitations are particularly apparent in remote care settings, where patients are reported to frequently exchange multi-modal information, such as smartphone-captured skin photographs, electrocardiogram (ECG) tracings, or scanned laboratory reports, with their clinicians.
AMIE multi-modal Reasoning Study Design
The present study aimed to address this persistent medical AI limitation by developing a multi-modal system that could emulate the structured reasoning of experienced clinicians by strategically requesting and interpreting these visual artifacts during a live diagnostic consultation.
The system was named “AMIE” and was built on the Gemini 2.0 Flash foundation model, enhanced by a novel "state-aware" inference-time reasoning framework. AMIE’s custom architecture was designed to allow the model to maintain an internal "patient state" that tracks each patient’s Chief Complaint, History of Present Illness, and prioritized knowledge gaps.
During clinical use, the framework was built to specifically direct the diagnostic consultation through three sequential phases:
History-taking, in which the system iteratively updates a patient's profile and identifies information gaps. Furthermore, the model determines if and when to request multi-modal artifacts to enhance its understanding of the patient’s clinical history.
Diagnosis and management, during which the system generates a Differential Diagnosis (DDx) report that provides patient-facing explanations and management guidance for the most relevant identified conditions.
Follow-up, in which the AI processes and clarifies any concerns the patient may have and communicates the final management plan, ensuring patient or caregiver clarity.
Model performance was validated using an Objective Structured Clinical Examination (OSCE) format adapted for synchronous chat, in which AMIE was evaluated against 19 primary care physicians (PCPs). The patient cohort comprised 25 validated patient-actors who participated in 210 consultations, two per scenario.
The examination scenarios were grounded in real-world datasets: the Skin Condition Image Network (SCIN) for dermatology, PTB-XL for ECG tracings, and curated clinical documents.
Performance was assessed by 18 specialist physicians using the multi-modal Understanding and Handling (MUH) rubric, the Practical Assessment of Clinical Examination Skills (PACES), and the General Medical Council Patient Questionnaire (GMCPQ).
Diagnostic Accuracy and Consultation Findings
The OSCE assessment data indicated that the multi-modal AMIE demonstrated significant performance advantages over PCPs in both objective accuracy and subjective quality measures, 29 out of the 32 evaluated metrics.
When evaluating diagnostic accuracy, statistical modeling confirmed that the AI's DDx lists were more accurate and comprehensive than those of human physicians (P < 0.001). Although accuracy was analyzed across lists containing 1 to 10 diagnoses, neither AMIE nor PCPs always submitted 10 differential diagnoses. Across all modalities, the AI's top-k accuracy consistently exceeded PCP performance for lists containing 1 to 10 diagnoses.
In a separate automated ablation analysis across Clinical Document scenarios, the AI's top-1 accuracy reached 0.98, compared to 0.89 for the "Vanilla" baseline Gemini 2.0 Flash model, indicating that state-aware reasoning improved performance beyond the foundation model alone.
In evaluations of multi-modal reasoning and overall robustness, specialist evaluations using the MUH rubric favored the AI in 7 of 9 metrics. AMIE proved particularly robust to variations in image quality, with low-quality images causing a larger drop in diagnostic performance for PCPs than for AMIE. In this simulated evaluation, AMIE also showed fewer, less consequential artifact-related misreporting events than PCPs (P < 0.001).
Furthermore, patient-actors rated the AI significantly higher across 10 of 11 GMCPQ criteria, including showing empathy and listening. In multi-modal tasks, the AI was rated more favorably for its ability to explain findings (P < 0.01).
Conversational Diagnostic AI Implications
The present study uses data representative of real-world clinical scenarios to highlight how integrating perceptual grounding with state-aware reasoning enables cutting-edge AI models to achieve performance that matches or surpasses that of PCPs in these simulated diagnostic settings.
Despite these results, the researchers caution that the study is an exploratory investigation, not a randomized clinical trial. Future work must evaluate the system's performance, safety, reliability, impact on clinical workflows, and health equity in real-world environments before clinical deployment can be considered.
Download your PDF copy by clicking here.
Journal reference:
- Saab, K., et al. (2026). Advancing conversational diagnostic AI with multimodal reasoning. Nature Medicine. DOI, 10.1038/s41591-026-04371-0. https://www.nature.com/articles/s41591-026-04371-0