AI diagnostic reasoning nears physician performance

by · News-Medical

A recent Perspective article published in Science explores whether advanced artificial intelligence (AI) systems are approaching physician-level reasoning, while considering the implications and safety of their integration into clinical practice.

Progress in AI and diagnostic reasoning

Large language models (LLMs) are AI algorithms trained on substantial amounts of data to learn patterns that are then used to generate human-like responses. Reasoning models add to these capabilities by evaluating possible approaches before generating a response, thereby mimicking structured cognitive processing.

Numerous studies have evaluated healthcare applications of LLMs, including their performance on medical licensing examinations and other relevant assessments. These evaluations often extend beyond standard tests to include simulated clinical scenarios such as diagnostic case vignettes, specialty-specific exams, and problem-solving tasks designed to approximate clinical decision-making processes.

Discussing findings from Brodeur et al., the authors note that GPT-4 by OpenAI has achieved exact or very close diagnostic accuracy in up to 73 % of cases, with the company’s first reasoning model, o1-preview, exceeding that performance at 88.6 % on clinicopathological cases.

Read our interview with Dr Rahul Goyal to learn how AI is changing clinical decision-making in real-world healthcare settings

How AI is being integrated into clinical practice

It is important to emphasize that AI systems are not being proposed as replacements for physicians. Rather, research in this area considers LLMs and other advanced models as collaborative tools, with clinicians providing accountability, oversight, and contextual judgment.

However, the authors also note that some well-defined healthcare tasks may ultimately be performed more effectively by AI systems operating independently. AI applications in healthcare have the potential to significantly reduce the human and financial costs associated with diagnostic errors, delays, and limited access.

The Medical Holistic Evaluation of Language Models (Med-HELM) defines five healthcare domains for AI use, including administrative workflows, clinical note generation, clinical decision support, patient communication, and medical research assistance. Across these domains, AI has evolved to analyze patient records, monitor clinical encounters, and interact with predictive models, thereby minimizing delays, reducing diagnostic errors, and improving access to care.

Nevertheless, it remains unclear whether advanced AI models would operate more effectively for specific tasks or independently across healthcare. As clinicians increasingly integrate AI tools into their practice, with some already doing so without institutional oversight, randomized trials are urgently needed to establish how these models are improving real-world applications.

Requiring clinical certification of AI models has also been proposed to expand the role of AI in medicine while ensuring transparency and accountability. The proposed pathway would gradually advance AI systems from medical knowledge assistants to supervised clinical practice and, potentially, to broader autonomous responsibilities. The implementation of robust monitoring frameworks can complement these initiatives to further support the safety, efficiency, and cost of AI clinical decision support systems.

Despite these efforts, AI has had limited real-world success due to inadequate benchmark performance and unclear clinical benefits. Although newer multimodal systems can now integrate images, audio, and video, many medical AI evaluations remain focused on text-only tasks, which limits their ability to fully capture the complexity of clinical decision-making.

The authors also highlight concerns surrounding the rapid deployment of consumer-facing health AI systems. In one example, an independent evaluation found that a publicly available health-focused AI tool under-triaged more than half of emergency cases presented to it.

Beyond diagnostic accuracy, the perspective emphasizes that clinical AI systems must demonstrate real-world effectiveness, equity, safety, transparency, and accountability before they can be widely adopted. The authors also note that previous healthcare algorithms have exhibited racial bias and that biased AI systems can negatively affect clinician decision-making.

Download your PDF copy by clicking here.

Journal reference: