OpenAI’s ChatGPT recently demonstrated its potential within the realm of imaging referrals when prompted with specific clinical presentations, leading experts to suggest that the infamous chatbot could serve as a valuable complementary tool in busy medical settings.
Specifically, the most recent version of ChatGPT, ChatGPT-4, provided imaging recommendations and generated radiology referrals based on clinical notes from cases that presented to an emergency department. When experts evaluated these recommendations based on clarity, clinical relevance, and differential diagnosis, the chatbot appropriately responded in approximately 95% of cases, using the American College of Radiology Appropriateness Criteria (ACR AC) as the reference standard.
Experts involved in the study suggested that with proper training, large language models like ChatGPT could help address the issue of inappropriate imaging referrals, which are common in emergency departments.
“Radiologists rely on the information provided in referral notes to determine and interpret imaging examinations. High-quality referrals are crucial for selecting the appropriate examination and imaging protocol,” explained corresponding author Yiftach Barash, MD, from the Department of Diagnostic Imaging at Chaim Sheba Medical Center in Tel Hashomer, Israel, and colleagues. “The quality of referral notes directly affects the accuracy of interpretation, the clinical relevance of radiology reports, and the confidence of interpreting radiologists.”
To gain a better understanding of ChatGPT’s usefulness in imaging referrals, the group retrospectively extracted five consecutive clinical notes from the emergency department related to various conditions, including pulmonary embolism, obstructing kidney stones, acute appendicitis, diverticulitis, small bowel obstruction, acute cholecystitis, acute hip fracture and testicular torsion. After inputting the case notes into ChatGPT, the group instructed the language model to recommend the most suitable imaging examination and protocol, as well as generate radiology referrals.
Two radiologists compared ChatGPT’s responses to the ACR AC, grading them on a scale of 1 to 5. They found that all of ChatGPT-4’s imaging recommendations aligned with the ACR AC. The chatbot also excelled in suggesting protocols, with only 5% of its protocol suggestions being discrepant. Its referrals received impressive mean scores of 4.6 and 4.8 for clarity, 4.5 and 4.4 for clinical relevance, and 4.9 (from both readers) for differential diagnosis.
While the chatbot performed consistently across multiple measures, experts identified an issue with the exclusion of timeframes in its recommendations. Despite having access to this information in the clinical notes, ChatGPT-4 did not reference symptom onset in 35% of its recommendations, even in cases where the timing of its recommended studies was critical.
Overall, the authors were encouraged by the language model’s performance and expressed optimism regarding its potential to enhance clinical efficiency. However, they emphasized the importance of remaining “mindful of the challenges and risks” associated with adopting such technology for radiologists.
The study abstract is available in the Journal of the American College of Radiology.