Background: Large language models (LLMs) have demonstrated transformative potential in health care. They can enhance clinical and academic medicine by facilitating accurate diagnoses, interpreting laboratory results, and automating documentation processes. This study evaluates the efficacy of LLMs in generating surgical operation reports and discharge summaries, focusing on accuracy, efficiency, and quality. Methods: This study assessed the effectiveness of three leading LLMs—ChatGPT-4.0, ChatGPT-4o, and Claude—using six prompts and analyzing their responses for readability and output quality, validated by plastic surgeons. Readability was measured with the Flesch–Kincaid, Flesch reading ease scores, and Coleman–Liau Index, while reliability was evaluated using the DISCERN score. A paired two-tailed t-test (p<0.05) compared the statistical significance of these metrics and the time taken to generate operation reports and discharge summaries against the authors’ results. Results: Table 3 shows statistically significant differences in readability between ChatGPT-4o and Claude across all metrics, while ChatGPT-4 and Claude differ significantly in the Flesch reading ease and Coleman–Liau indices. Table 6 reveals extremely low p-values across BL, IS, and MM for all models, with Claude consistently outperforming both ChatGPT-4 and ChatGPT-4o. Additionally, Claude generated documents the fastest, completing tasks in approximately 10 to 14 s. These results suggest that Claude not only excels in readability but also demonstrates superior reliability and speed, making it an efficient choice for practical applications. Conclusion: The study highlights the importance of selecting appropriate LLMs for clinical use. Integrating these LLMs can streamline healthcare documentation, improve efficiency, and enhance patient outcomes through clearer communication and more accurate medical reports. Level of Evidence V: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266.

Lim, B., Seth, I., Maxwell, M., Cuomo, R., Ross, R.J., Rozen, W.M. (2025). Evaluating the Efficacy of Large Language Models in Generating Medical Documentation: A Comparative Study of ChatGPT-4, ChatGPT-4o, and Claude. AESTHETIC PLASTIC SURGERY [10.1007/s00266-025-04842-8].

Evaluating the Efficacy of Large Language Models in Generating Medical Documentation: A Comparative Study of ChatGPT-4, ChatGPT-4o, and Claude

Cuomo R.;
2025-01-01

Abstract

Background: Large language models (LLMs) have demonstrated transformative potential in health care. They can enhance clinical and academic medicine by facilitating accurate diagnoses, interpreting laboratory results, and automating documentation processes. This study evaluates the efficacy of LLMs in generating surgical operation reports and discharge summaries, focusing on accuracy, efficiency, and quality. Methods: This study assessed the effectiveness of three leading LLMs—ChatGPT-4.0, ChatGPT-4o, and Claude—using six prompts and analyzing their responses for readability and output quality, validated by plastic surgeons. Readability was measured with the Flesch–Kincaid, Flesch reading ease scores, and Coleman–Liau Index, while reliability was evaluated using the DISCERN score. A paired two-tailed t-test (p<0.05) compared the statistical significance of these metrics and the time taken to generate operation reports and discharge summaries against the authors’ results. Results: Table 3 shows statistically significant differences in readability between ChatGPT-4o and Claude across all metrics, while ChatGPT-4 and Claude differ significantly in the Flesch reading ease and Coleman–Liau indices. Table 6 reveals extremely low p-values across BL, IS, and MM for all models, with Claude consistently outperforming both ChatGPT-4 and ChatGPT-4o. Additionally, Claude generated documents the fastest, completing tasks in approximately 10 to 14 s. These results suggest that Claude not only excels in readability but also demonstrates superior reliability and speed, making it an efficient choice for practical applications. Conclusion: The study highlights the importance of selecting appropriate LLMs for clinical use. Integrating these LLMs can streamline healthcare documentation, improve efficiency, and enhance patient outcomes through clearer communication and more accurate medical reports. Level of Evidence V: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266.
2025
Lim, B., Seth, I., Maxwell, M., Cuomo, R., Ross, R.J., Rozen, W.M. (2025). Evaluating the Efficacy of Large Language Models in Generating Medical Documentation: A Comparative Study of ChatGPT-4, ChatGPT-4o, and Claude. AESTHETIC PLASTIC SURGERY [10.1007/s00266-025-04842-8].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/1294414
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo