his study explored the potential role of ChatGPT as a retrospective evaluator of triage appropriateness according to the Tuscan Triage System. Fifty real-world Emergency Department clinical scenarios were independently assessed by certified triage experts and compared with evaluations generated by ChatGPT (GPT-4o). Agreement between the model and the expert-defined reference standard was analyzed using Cohen’s kappa, precision, recall, F1-score, sensitivity, and specificity. Exact agreement was observed in 46% of cases, with under-triage discrepancies occurring more frequently than over-triage. Performance was higher in high-complexity scenarios, while lower agreement was observed in moderate- and low-complexity cases. Although the current results do not support autonomous use of large language models for clinical triage, the findings suggest a potential future role in retrospective quality assurance and audit processes. Further studies are required to improve performance and validate the application of artificial intelligence tools in emergency nursing practice.

Ramacciani Isemann, C., Burresi, S., Innocenti, S., Righi, L. (2026). Can AI improve triage quality? A preliminary assessment of ChatGPT performance in evaluating triage decisions. In International Journal of Health Supplements (pp.43-43). Roma : iEditore.

Can AI improve triage quality? A preliminary assessment of ChatGPT performance in evaluating triage decisions

Christian Ramacciani Isemann
Writing – Original Draft Preparation
;
Simona Burresi
Resources
;
Lorenzo Righi
Writing – Review & Editing
2026-01-01

Abstract

his study explored the potential role of ChatGPT as a retrospective evaluator of triage appropriateness according to the Tuscan Triage System. Fifty real-world Emergency Department clinical scenarios were independently assessed by certified triage experts and compared with evaluations generated by ChatGPT (GPT-4o). Agreement between the model and the expert-defined reference standard was analyzed using Cohen’s kappa, precision, recall, F1-score, sensitivity, and specificity. Exact agreement was observed in 46% of cases, with under-triage discrepancies occurring more frequently than over-triage. Performance was higher in high-complexity scenarios, while lower agreement was observed in moderate- and low-complexity cases. Although the current results do not support autonomous use of large language models for clinical triage, the findings suggest a potential future role in retrospective quality assurance and audit processes. Further studies are required to improve performance and validate the application of artificial intelligence tools in emergency nursing practice.
2026
Ramacciani Isemann, C., Burresi, S., Innocenti, S., Righi, L. (2026). Can AI improve triage quality? A preliminary assessment of ChatGPT performance in evaluating triage decisions. In International Journal of Health Supplements (pp.43-43). Roma : iEditore.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/1320254
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo