Resumen
Purpose: To evaluate the agreement and performance of four large language models (LLMs)—ChatGPT-3.5, ChatGPT-4.0, Leny-ai, and MediSearch—in diagnosing and classifying Dry Eye Disease (DED), compared to clinician judgment and Dry Eye Workshop-II (DEWS-II) criteria. Methods: A standardized prompt incorporating retrospective clinical and symptomatic data from patients with suspected DED referred to a dry eye clinic was developed. LLMs were evaluated for diagnosis (DED vs. no DED) and classification (aqueous-deficient, evaporative, mixed-component). Agreement was assessed using Cohen's-kappa (Cκ) and Fleiss’-kappa (Fκ). Balanced accuracy, sensitivity, specificity, and F1 score were calculated. Results: Among 338 patients (78.6 % female, mean age 53.2 years), clinicians diagnosed DED in 300, and DEWS-II criteria identified 234. LLMs showed high agreement with clinicians for DED diagnosis (93 %–99 %, Cκ: 0.81–0.86). Subtype agreement was lower (aqueous-deficient: 0 %–18 %, evaporative: 4 %–80 %, mixed-component: 22 %–92 %; Fκ: −0.20 to −0.10). Diagnostic balanced accuracy was 48 %–56 %, with high sensitivity (93 %–99 %) but low specificity (0 %–16 %). Subtype balanced accuracy and F1 score ranged from 33 %-81 % 0 %–71 %, respectively. Compared to DEWS-II, agreement for DED diagnosis remained high (96 %–99 %) but with weaker Cκ (0.52–0.58). Subtype agreement was again low (aqueous-deficient: 0 %–20 %, evaporative: 9 %–68 %, mixed-component: 16 %–75 %; Fκ: −0.09 to −0.02). Diagnostic balanced accuracy was 49 %–56 %, sensitivity 97 %–99 %, and specificity 5 %–16 %. Subtype balanced accuracy ranged from 43 % to 56 %, F1 score 0–68. Conclusion: LLMs showed strong agreement and high sensitivity for DED diagnosis but limited specificity and poor subtype classification, mirroring clinical challenges and highlighting risks of overdiagnosis.
| Idioma original | Inglés estadounidense |
|---|---|
| Número de artículo | 102509 |
| Publicación | Contact Lens and Anterior Eye |
| DOI | |
| Estado | En prensa - 2025 |
Áreas temáticas de ASJC Scopus
- Oftalmología
- Optometría
Huella
Profundice en los temas de investigación de 'Diagnostic accuracy in dry eye: Insights into clinical and artificial intelligence limitations: Limitations of diagnostic accuracy in dry eye'. En conjunto forman una huella única.Citar esto
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver