Skip to main navigation Skip to search Skip to main content

Diagnostic accuracy in dry eye: Insights into clinical and artificial intelligence limitations: Limitations of diagnostic accuracy in dry eye

  • Germán Mejía-Salgado
  • , William Rojas-Carabali
  • , Carlos Cifuentes-González
  • , María Andrea Bernal-Valencia
  • , Paola Saboya-Galindo
  • , Jaime Soto-Ariño
  • , Valentina Dumar-Kerguelen
  • , Guillermo Marroquín-Gómez
  • , Martha Lucía Moreno-Pardo
  • , Juliana Tirado-Ángel
  • , Anat Galor
  • , Alejandra de-la-Torre

Research output: Contribution to JournalResearch Articlepeer-review

Abstract

Purpose: To evaluate the agreement and performance of four large language models (LLMs)—ChatGPT-3.5, ChatGPT-4.0, Leny-ai, and MediSearch—in diagnosing and classifying Dry Eye Disease (DED), compared to clinician judgment and Dry Eye Workshop-II (DEWS-II) criteria. Methods: A standardized prompt incorporating retrospective clinical and symptomatic data from patients with suspected DED referred to a dry eye clinic was developed. LLMs were evaluated for diagnosis (DED vs. no DED) and classification (aqueous-deficient, evaporative, mixed-component). Agreement was assessed using Cohen's-kappa (Cκ) and Fleiss’-kappa (Fκ). Balanced accuracy, sensitivity, specificity, and F1 score were calculated. Results: Among 338 patients (78.6 % female, mean age 53.2 years), clinicians diagnosed DED in 300, and DEWS-II criteria identified 234. LLMs showed high agreement with clinicians for DED diagnosis (93 %–99 %, Cκ: 0.81–0.86). Subtype agreement was lower (aqueous-deficient: 0 %–18 %, evaporative: 4 %–80 %, mixed-component: 22 %–92 %; Fκ: −0.20 to −0.10). Diagnostic balanced accuracy was 48 %–56 %, with high sensitivity (93 %–99 %) but low specificity (0 %–16 %). Subtype balanced accuracy and F1 score ranged from 33 %-81 % 0 %–71 %, respectively. Compared to DEWS-II, agreement for DED diagnosis remained high (96 %–99 %) but with weaker Cκ (0.52–0.58). Subtype agreement was again low (aqueous-deficient: 0 %–20 %, evaporative: 9 %–68 %, mixed-component: 16 %–75 %; Fκ: −0.09 to −0.02). Diagnostic balanced accuracy was 49 %–56 %, sensitivity 97 %–99 %, and specificity 5 %–16 %. Subtype balanced accuracy ranged from 43 % to 56 %, F1 score 0–68. Conclusion: LLMs showed strong agreement and high sensitivity for DED diagnosis but limited specificity and poor subtype classification, mirroring clinical challenges and highlighting risks of overdiagnosis.

Original languageEnglish (US)
Article number102509
JournalContact Lens and Anterior Eye
DOIs
StateAccepted/In press - 2025

All Science Journal Classification (ASJC) codes

  • Ophthalmology
  • Optometry

Fingerprint

Dive into the research topics of 'Diagnostic accuracy in dry eye: Insights into clinical and artificial intelligence limitations: Limitations of diagnostic accuracy in dry eye'. Together they form a unique fingerprint.

Cite this