Tekchandani, Niharika, Mukherjee, Anurup, Poonthottam, Nandakumar, Boussios, Stergios (2025) Comparative analysis of large language models in dermatological diagnosis: an evaluation of diagnostic accuracy. Cureus, 17 (9). Article Number e92089. E-ISSN 2168-8184. (doi:10.7759/cureus.92089) (KAR id:111418)
|
PDF
Publisher pdf
Language: English
This work is licensed under a Creative Commons Attribution 4.0 International License.
|
|
|
Download this file (PDF/610kB) |
Preview |
| Request a format suitable for use with assistive technology e.g. a screenreader | |
| Official URL: https://doi.org/10.7759/cureus.92089 |
|
Abstract
The diagnostic process in dermatology often hinges on visual recognition and clinical pattern matching, making it an attractive field for the application of artificial intelligence (AI). Large language models (LLMs) like ChatGPT-4o, Claude 3.7 Sonnet, and Gemini 2.0 Flash offer new possibilities for augmenting diagnostic reasoning, particularly in rare or diagnostically challenging cases. This study evaluates and compares the diagnostic capabilities of these LLMs based solely on clinical presentations extracted from rare dermatological case reports. Fifteen published case reports of rare dermatological conditions were retrospectively selected. Key clinical features, excluding laboratory or histopathological findings, were input into each of the three LLMs using standardized prompts. Each model produced a most probable diagnosis and a list of differential diagnoses. The outputs were evaluated for top-match accuracy and whether the correct diagnosis was included in the differential list. Performance was analyzed descriptively, with visual aids (heatmaps, bar charts) illustrating comparative outcomes. ChatGPT-4o and Claude 3.7 Sonnet each correctly identified the top diagnosis in 10 (66.7%) out of 15 cases, compared to 8 (53.3%) out of 15 for Gemini 2.0 Flash. When differential-only matches were included, both ChatGPT-4o and Claude 3.7 achieved a total coverage of 86.7%, while Gemini 2.0 reached 60.0%. Notably, all models failed to identify certain diagnoses, including blastic plasmacytoid dendritic cell neoplasm and amelanotic melanoma, underscoring the potential risks associated with plausible but incorrect outputs. This study demonstrates that ChatGPT-4o and Claude 3.7 Sonnet show promising diagnostic potential in rare dermatologic cases, outperforming Gemini 2.0 Flash in both accuracy and diagnostic breadth. While LLMs may assist in clinical reasoning, particularly in settings with limited dermatology expertise, they should be used as adjunctive tools, not substitutes, for clinician judgment. Further refinement, validation, and integration into clinical workflows are warranted. [Abstract copyright: Copyright © 2025, Tekchandani et al.]
| Item Type: | Article |
|---|---|
| DOI/Identification number: | 10.7759/cureus.92089 |
| Uncontrolled keywords: | rare skin diseases; diagnostic accuracy; large language models; artificial intelligence; dermatology diagnosis |
| Subjects: | R Medicine |
| Institutional Unit: | Schools > Kent and Medway Medical School |
| Former Institutional Unit: |
There are no former institutional units.
|
| Funders: | University of Kent (https://ror.org/00xkeyj56) |
| SWORD Depositor: | JISC Publications Router |
| Depositing User: | JISC Publications Router |
| Date Deposited: | 02 Oct 2025 09:05 UTC |
| Last Modified: | 03 Oct 2025 10:49 UTC |
| Resource URI: | https://kar.kent.ac.uk/id/eprint/111418 (The current URI for this page, for reference purposes) |
- Link to SensusAccess
- Export to:
- RefWorks
- EPrints3 XML
- BibTeX
- CSV
- Depositors only (login required):

Altmetric
Altmetric