Malignant melanoma is the deadliest form of skin cancer, and artificial intelligence could help address its diagnostic challenges. Generative Adversarial Networks (GANs) can generate synthetic dermoscopic images to augment limited real datasets, but the lack of standardised validation protocols holds back models’ reliability and clinicians’ trust. This study aims to design and develop a systematic validation protocol combining quantitative metrics and qualitative expert assessments to evaluate the realism, fidelity, diversity, and usefulness of synthetic dermoscopic melanoma images. A StyleGAN2 model, designed and trained in a previous study, was selected for its superior quantitative performance and exploited to generate 25 synthetic melanoma images, matched with 25 real images. A panel of 17 dermoscopists assessed the images using a 7-point Likert scale, across multiple qualitative attributes (real vs. synthetic, skin texture, visual realism, and confidence) and pattern analysis. Accuracy, sensitivity, specificity, Fleiss’ Kappa, and Krippendorff’s Alpha were calculated to analyse inter-rater agreement and evaluation outcomes. Accuracy in real vs synthetic images classification was moderate (64 %), with sensitivity at 73 % and specificity at 56 %, with poor inter-rater concordance over qualitative attributes. Synthetic images obtained superior scores in medium visual and overall realism, and confidence level, while the frequency of recognition of pigment network-patterns was comparable with real images. The proposed holistic validation protocol can effectively estimate the quality level of synthetic dermoscopic images, regardless of the architecture of the model used for generation, offering an objective and reliable evaluation tool, as qualitative evaluations remain crucial to ensure their safe deployment in clinical settings.
Luschi, A., Tognetti, L., Cartocci, A., Cinotti, E., Rubegni, G., Calabrese, L., et al. (2025). Design and development of a systematic validation protocol for synthetic melanoma images for responsible use in medical artificial intelligence. BIOCYBERNETICS AND BIOMEDICAL ENGINEERING, 45(4), 608-616 [10.1016/j.bbe.2025.09.001].
Design and development of a systematic validation protocol for synthetic melanoma images for responsible use in medical artificial intelligence
Alessio Luschi
;Linda Tognetti;Alessandra Cartocci;Elisa Cinotti;Giovanni Rubegni;Laura Calabrese;Martina D’onghia;Martina Dragotto;Gabriele Cevenini;Pietro Rubegni;Ernesto Iadanza
2025-01-01
Abstract
Malignant melanoma is the deadliest form of skin cancer, and artificial intelligence could help address its diagnostic challenges. Generative Adversarial Networks (GANs) can generate synthetic dermoscopic images to augment limited real datasets, but the lack of standardised validation protocols holds back models’ reliability and clinicians’ trust. This study aims to design and develop a systematic validation protocol combining quantitative metrics and qualitative expert assessments to evaluate the realism, fidelity, diversity, and usefulness of synthetic dermoscopic melanoma images. A StyleGAN2 model, designed and trained in a previous study, was selected for its superior quantitative performance and exploited to generate 25 synthetic melanoma images, matched with 25 real images. A panel of 17 dermoscopists assessed the images using a 7-point Likert scale, across multiple qualitative attributes (real vs. synthetic, skin texture, visual realism, and confidence) and pattern analysis. Accuracy, sensitivity, specificity, Fleiss’ Kappa, and Krippendorff’s Alpha were calculated to analyse inter-rater agreement and evaluation outcomes. Accuracy in real vs synthetic images classification was moderate (64 %), with sensitivity at 73 % and specificity at 56 %, with poor inter-rater concordance over qualitative attributes. Synthetic images obtained superior scores in medium visual and overall realism, and confidence level, while the frequency of recognition of pigment network-patterns was comparable with real images. The proposed holistic validation protocol can effectively estimate the quality level of synthetic dermoscopic images, regardless of the architecture of the model used for generation, offering an objective and reliable evaluation tool, as qualitative evaluations remain crucial to ensure their safe deployment in clinical settings.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11365/1299734
Attenzione
Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo
