Ir directamente a la navegación principal Ir directamente a la búsqueda Ir directamente al contenido principal

Diagnostic accuracy in dry eye: Insights into clinical and artificial intelligence limitations: Limitations of diagnostic accuracy in dry eye

  • Germán Mejía-Salgado
  • , William Rojas-Carabali
  • , Carlos Cifuentes-González
  • , María Andrea Bernal-Valencia
  • , Paola Saboya-Galindo
  • , Jaime Soto-Ariño
  • , Valentina Dumar-Kerguelen
  • , Guillermo Marroquín-Gómez
  • , Martha Lucía Moreno-Pardo
  • , Juliana Tirado-Ángel
  • , Anat Galor
  • , Alejandra de-la-Torre

Producción científica: Contribución a revistaArtículo de Investigaciónrevisión exhaustiva

Resumen

Purpose: To evaluate the agreement and performance of four large language models (LLMs)—ChatGPT-3.5, ChatGPT-4.0, Leny-ai, and MediSearch—in diagnosing and classifying Dry Eye Disease (DED), compared to clinician judgment and Dry Eye Workshop-II (DEWS-II) criteria. Methods: A standardized prompt incorporating retrospective clinical and symptomatic data from patients with suspected DED referred to a dry eye clinic was developed. LLMs were evaluated for diagnosis (DED vs. no DED) and classification (aqueous-deficient, evaporative, mixed-component). Agreement was assessed using Cohen's-kappa (Cκ) and Fleiss’-kappa (Fκ). Balanced accuracy, sensitivity, specificity, and F1 score were calculated. Results: Among 338 patients (78.6 % female, mean age 53.2 years), clinicians diagnosed DED in 300, and DEWS-II criteria identified 234. LLMs showed high agreement with clinicians for DED diagnosis (93 %–99 %, Cκ: 0.81–0.86). Subtype agreement was lower (aqueous-deficient: 0 %–18 %, evaporative: 4 %–80 %, mixed-component: 22 %–92 %; Fκ: −0.20 to −0.10). Diagnostic balanced accuracy was 48 %–56 %, with high sensitivity (93 %–99 %) but low specificity (0 %–16 %). Subtype balanced accuracy and F1 score ranged from 33 %-81 % 0 %–71 %, respectively. Compared to DEWS-II, agreement for DED diagnosis remained high (96 %–99 %) but with weaker Cκ (0.52–0.58). Subtype agreement was again low (aqueous-deficient: 0 %–20 %, evaporative: 9 %–68 %, mixed-component: 16 %–75 %; Fκ: −0.09 to −0.02). Diagnostic balanced accuracy was 49 %–56 %, sensitivity 97 %–99 %, and specificity 5 %–16 %. Subtype balanced accuracy ranged from 43 % to 56 %, F1 score 0–68. Conclusion: LLMs showed strong agreement and high sensitivity for DED diagnosis but limited specificity and poor subtype classification, mirroring clinical challenges and highlighting risks of overdiagnosis.

Idioma originalInglés estadounidense
Número de artículo102509
PublicaciónContact Lens and Anterior Eye
DOI
EstadoEn prensa - 2025

Áreas temáticas de ASJC Scopus

  • Oftalmología
  • Optometría

Huella

Profundice en los temas de investigación de 'Diagnostic accuracy in dry eye: Insights into clinical and artificial intelligence limitations: Limitations of diagnostic accuracy in dry eye'. En conjunto forman una huella única.

Citar esto