ChatGPT demonstrated “impressive accuracy” when used during clinical decision-making, although it was less effective when making management decisions and differential diagnoses, according to researchers.
“Our paper comprehensively assesses decision support via ChatGPT from the very beginning of working with a patient through the entire care scenario, from differential diagnosis all the way through testing, diagnosis and management,” Marc D. Succi, MD, senior author and associate chair of innovation and commercialization at Mass General Brigham Radiology, said in a press release.
Succi and colleagues tested the efficacy of the large-language model (LLM) artificial intelligence (AI) chatbot using 36 clinical vignettes across various specialties. The researchers first asked ChatGPT to develop differential diagnoses based on initial information, including the patient’s age, gender and symptoms. The researchers then provided the chatbot with additional information and asked it to make management decisions and a final diagnosis.
Succi and colleagues found that ChatGPT was 71.7% (95% CI, 69.3-74.1) accurate overall in clinical decision-making, 76.9% (95% CI, 67.8-86.1) accurate when making final diagnoses and 60.3% (95% CI, 54.2-66.6) accurate in making differential diagnoses.
“This tells us that LLMs in general have the potential to be an augmenting tool for the practice of medicine and support clinical decision-making with impressive accuracy,” Succi said.
According to the researchers, limitations to ChatGPT include “possible model hallucinations and the unclear composition of ChatGPT’s training data set.”
Healio spoke with Succi to learn more about the clinical implications of the study for primary care physicians, the potential downsides from AI hallucinations and more.
Healio: Can you briefly describe the findings, and if there were any findings that stood out to you?
Succi: This study comprehensively evaluates the performance of ChatGPT as one would see a patient in the clinical setting. [There have] been some studies that say it’s scored this well on a board exam or something that’s very different than how you actually practice medicine.
What we wanted to do was run it through four key components of a patient visit. So, that means coming up with a differential diagnosis, figuring out what diagnostic test to order, figuring out the final diagnosis and managing the patient.
To do this, we used 36 clinical vignettes from the Merck Clinical Manual. It’s across all specialties. Most of these are bread and butter kind of clinical cases.
It performed very well with 72% overall accuracy. That would be a passing score. We can take guesses, but that’s about at the level of an intern who, let’s say, just graduated medical school.
But what was new about this study is it was segmented by those four components. So, it did the best in coming up with a final diagnosis and was 77% accurate. But it did the worst in making that initial differential diagnosis, when you have minimal information — no lab tests, etc. It scored only 60%.
What this tells us is that GPT, although it can do well in a final diagnosis, we’ve got to be smart about it and see that there are different components to seeing a patient, and it doesn’t perform equally in all of them. The hope would be that you can use this study — when you make these models — to critically evaluate how well it does in these different areas. For example, we need to work on how well it does in differential diagnosis to make it useful in clinical setting.
Healio: Patients are not all that comfortable with AI-led visits. Even if ChatGPT has “impressive” clinical decision-making ability, how do you get patients comfortable with the idea that a diagnosis or treatment plan originated from AI?
Succi: My one pushback would be that I don’t think it’ll come from AI. It’ll come from AI with a physician or health care provider in the loop. That’s the key thing that Mass General Brigham is doing. A health care provider is always in the loop. You’re not autonomously getting an AI diagnosis. You’re getting an AI diagnosis that was checked, supervised, etc., by a physician, which is very similar to how things function now. We have interns and residents at academic centers who might do the bulk of the work. But ultimately there’s an attending physician who is making the decisions and integrating different sources of information from the AI or the resident.
To get patients comfortable, I think No. 1 is that there’s reassurance there’s still a doctor behind all this, and that’s who’s making the final decision. No. 2 is by doing studies like this and critically evaluating how good this is and not simply taking a company’s word for it.
We need to study it ourselves, and we need to have access to the data from these companies. So, seeing the institutions that are using this technology study how good it is and help it get better, I think, is reassuring and could be reassuring to patients.
Ultimately, we need patient buy-in. So, we need to listen to their concerns. Maybe they have a particular reason why they’re uncomfortable. Maybe it’s a privacy thing. Those concerns need to be addressed — they’re definitely challenges.
Healio: Can you give an example of possible drawbacks from an AI hallucination?
Succi: You can imagine asking an LLM, “What diagnostic test should we order?” and it suggests 30 extra tests that a physician would not necessarily suggest, and those could be considered misalignment, hallucinations or inaccuracies. All of a sudden, your visit that costs “X” dollars now costs 10 times that.
I think in general, not specific to the study, but the risk of hallucinations and incorrect information provided by AI is that it could paradoxically increase the cost of the health care system or increase wait times. But that’s why we’re evaluating it in the first place over broad vignettes to see how good it actually is.
Healio: What should a practicing physician be thinking about in light of this information?
Succi: I think in general, AI and more specifically LLMs are another tool in the toolbox to
providing good patient care. And so, just like a stethoscope allows you to collect more information from the heart and integrate that into your diagnosis, AI makes you more efficient. But it still needs supervision. A human walking is very slow and inefficient, but a human on a bicycle is the most efficient animal. AI is a bicycle for the health care provider, and that’s how it should be looked at. For health care providers reading this, AI won’t replace doctors, but doctors who use AI will replace doctors who don’t use AI. So, it’s one of those things where we need to broadly and from a cultural perspective adopt it, study it and integrate it into our daily practice because it’s cause it’s coming sooner or later.