his study explored the potential role of ChatGPT as a retrospective evaluator of triage appropriateness according to the Tuscan Triage System. Fifty real-world Emergency Department clinical scenarios were independently assessed by certified triage experts and compared with evaluations generated by ChatGPT (GPT-4o). Agreement between the model and the expert-defined reference standard was analyzed using Cohen’s kappa, precision, recall, F1-score, sensitivity, and specificity. Exact agreement was observed in 46% of cases, with under-triage discrepancies occurring more frequently than over-triage. Performance was higher in high-complexity scenarios, while lower agreement was observed in moderate- and low-complexity cases. Although the current results do not support autonomous use of large language models for clinical triage, the findings suggest a potential future role in retrospective quality assurance and audit processes. Further studies are required to improve performance and validate the application of artificial intelligence tools in emergency nursing practice.
Ramacciani Isemann, C., Burresi, S., Innocenti, S., Righi, L. (2026). Can AI improve triage quality? A preliminary assessment of ChatGPT performance in evaluating triage decisions. In International Journal of Health Supplements (pp.43-43). Roma : iEditore.
Can AI improve triage quality? A preliminary assessment of ChatGPT performance in evaluating triage decisions
Christian Ramacciani Isemann
Writing – Original Draft Preparation
;Simona BurresiResources
;Lorenzo RighiWriting – Review & Editing
2026-01-01
Abstract
his study explored the potential role of ChatGPT as a retrospective evaluator of triage appropriateness according to the Tuscan Triage System. Fifty real-world Emergency Department clinical scenarios were independently assessed by certified triage experts and compared with evaluations generated by ChatGPT (GPT-4o). Agreement between the model and the expert-defined reference standard was analyzed using Cohen’s kappa, precision, recall, F1-score, sensitivity, and specificity. Exact agreement was observed in 46% of cases, with under-triage discrepancies occurring more frequently than over-triage. Performance was higher in high-complexity scenarios, while lower agreement was observed in moderate- and low-complexity cases. Although the current results do not support autonomous use of large language models for clinical triage, the findings suggest a potential future role in retrospective quality assurance and audit processes. Further studies are required to improve performance and validate the application of artificial intelligence tools in emergency nursing practice.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11365/1320254
Attenzione
Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo
