AUTHOR=Ríos-Hoyo Alejandro , Shan Naing Lin , Li Anran , Pearson Alexander T. , Pusztai Lajos , Howard Frederick M. TITLE=Evaluation of large language models as a diagnostic aid for complex medical cases JOURNAL=Frontiers in Medicine VOLUME=Volume 11 - 2024 YEAR=2024 URL=https://www.frontiersin.org/journals/medicine/articles/10.3389/fmed.2024.1380148 DOI=10.3389/fmed.2024.1380148 ISSN=2296-858X ABSTRACT=Background: Language models (LLM) have recently gained popularity in diverse areas, including answering questions posted by patients as well as medical professionals. Objective: To evaluate the performance and limitations of LLMs in providing the correct diagnosis for a complex clinical case. Design: Seventy-five consecutive clinical cases were selected from the Massachusetts General Hospital Case Records (MGH), and differential diagnoses were generated by OpenAI’s GPT3.5 and 4 models. Results: The mean number of diagnoses provided by the MGH case discussants was 16.77, by GPT3.5 30 and by GPT4 15.45 (p<0.0001). GPT4 was more frequently able to provide the correct diagnosis among the top three generated diagnoses (42% versus 24%, p = 0.075). GPT4 provided a differential list that was more similar to the list provided by the case discussants than GPT3.5 (Jaccard Similarity Index 0.22 versus 0.12, p = 0.001). Inclusion of the correct diagnosis in the generated differential was correlated with PubMed articles matching the diagnosis (OR 1.40, 95% CI 1.25 – 1.56 for GPT3.5, OR 1.25, 95% CI 1.13 – 1.40 for GPT4), but not with disease incidence. Conclusions and relevance: GPT4 was frequently able to generate a differential diagnosis list with the correct diagnosis, but the most likely diagnosis was often incorrect for both models. In its current state, this tool can be used as an aid to expand on potential diagnostic considerations for a case, and future LLMs should be trained which account for the discrepancy between disease incidence and availability in the literature.