A recent study led by researchers from Mass General Brigham has shed light on the accuracy of ChatGPT in clinical decision making.
The research revealed that the large-language model (LLM) AI chatbot achieved approximately 72% accuracy in overall clinical decisions, encompassing tasks from generating potential diagnoses to final diagnoses and care management choices.
The study included various medical specialties and was conducted in both primary care and emergency settings.
Comparable to Fresh-Grad Medical Professional
Lead author Marc Succi, MD, expressed that the performance of ChatGPT was comparable to that of a freshly graduated medical professional, highlighting the potential of LLMs to serve as effective tools in the realm of medicine.
“No real benchmarks exists, but we estimate this performance to be at the level of someone who has just graduated from medical school, such as an intern or resident. This tells us that LLMs in general have the potential to be an augmenting tool for the practice of medicine and support clinical decision making with impressive accuracy,” Succi said in a statement.
Despite the rapid advancements in artificial intelligence, the extent to which LLMs can contribute to comprehensive clinical care remains unexplored.
This study sought to investigate ChatGPT’s capabilities in advising and making clinical decisions across a complete patient encounter, including diagnostic workups, clinical management, and final diagnoses.
The research involved presenting segments of standardized clinical scenarios to ChatGPT, simulating real-world patient interactions. ChatGPT was tasked with generating differential diagnoses based on initial patient information, followed by making management decisions and arriving at a final diagnosis through successive iterations of data input.
The researchers discovered that ChatGPT’s accuracy averaged around 72%, with its highest performance observed in making final diagnoses at 77%. However, its accuracy was lower in making differential diagnoses (60%) and clinical management decisions (68%).
Notably, the study revealed that ChatGPT’s responses did not demonstrate gender bias and that its performance was consistent across primary and emergency care scenarios.
Succi emphasized that ChatGPT struggled with differential diagnosis, an essential aspect of medicine that requires determining potential courses of action when faced with limited patient information. It points to the strengths of physicians in the early stages of patient care, where generating a list of possible diagnoses is pivotal.