Microbiome data analysis is essential for understanding the role of microbial communities in human health. However, limited data availability often hinders research progress, and synthetic data generation could offer a promising solution to this problem. This study aims to explore the use of machine learning (ML) to enrich an unbalanced dataset consisting of microbial operational taxonomic unit (OTU) counts of 148 samples, belonging to 61 patients. In detail, 34 samples are from 16 adenomatous polyps (AP) patients, while 114 samples are from 46 colorectal cancer (CRC) patients. Synthesis of AP and CRC samples was conducted using the Synthetic Data Vault Python library, employing a Gaussian Copula synthesiser. Subsequently, the synthesised data quality was evaluated using a logistic regression model in parallel with an optimised support vector machine algorithm (polynomial kernel). The data quality is considered good when neither of the two algorithms can discriminate between real and synthetic data, showing low accuracy, F1 score, and precision values. Furthermore, additional statistical tests were employed to confirm the similarity between real and synthetic data. After data validation, layer-wise relevance propagation (LRP) was performed on a deep learning classifier to extract important OTU features from the generated dataset, to discriminate between CRC patients and those affected by AP. Exploiting the acquired features, which correspond to unique bacterial taxa, ML classifiers were trained and tested to estimate the validity of such microorganisms in recognising AP and CRC samples. The simplified version of the original OTU table opens up opportunities for further investigations, especially in the realm of extensive data synthesis. This involves a deeper exploration and augmentation of the condensed data to uncover new insights and patterns that might not be readily apparent in the original, more complex form. Digging deeper into the simplified data may help us better grasp the biological or ecological processes reflected in the OTU data. Transitioning from this exploration, the synergy of ML and synthetic data enrichment holds promise for advancing microbiome research. This approach enhances classification accuracy and reveals hidden microbial markers that could prove valuable in clinical practice as a diagnostic and prognostic tool.
Rotelli, A., Salman, A., Di Gloria, L., Nannini, G., Niccolai, E., Luschi, A., et al. (2025). Analysis of Microbiome for AP and CRC Discrimination. BIOENGINEERING, 12(7) [10.3390/bioengineering12070713].
Analysis of Microbiome for AP and CRC Discrimination
Salman, Ali;Luschi, Alessio
;Iadanza, Ernesto
2025-01-01
Abstract
Microbiome data analysis is essential for understanding the role of microbial communities in human health. However, limited data availability often hinders research progress, and synthetic data generation could offer a promising solution to this problem. This study aims to explore the use of machine learning (ML) to enrich an unbalanced dataset consisting of microbial operational taxonomic unit (OTU) counts of 148 samples, belonging to 61 patients. In detail, 34 samples are from 16 adenomatous polyps (AP) patients, while 114 samples are from 46 colorectal cancer (CRC) patients. Synthesis of AP and CRC samples was conducted using the Synthetic Data Vault Python library, employing a Gaussian Copula synthesiser. Subsequently, the synthesised data quality was evaluated using a logistic regression model in parallel with an optimised support vector machine algorithm (polynomial kernel). The data quality is considered good when neither of the two algorithms can discriminate between real and synthetic data, showing low accuracy, F1 score, and precision values. Furthermore, additional statistical tests were employed to confirm the similarity between real and synthetic data. After data validation, layer-wise relevance propagation (LRP) was performed on a deep learning classifier to extract important OTU features from the generated dataset, to discriminate between CRC patients and those affected by AP. Exploiting the acquired features, which correspond to unique bacterial taxa, ML classifiers were trained and tested to estimate the validity of such microorganisms in recognising AP and CRC samples. The simplified version of the original OTU table opens up opportunities for further investigations, especially in the realm of extensive data synthesis. This involves a deeper exploration and augmentation of the condensed data to uncover new insights and patterns that might not be readily apparent in the original, more complex form. Digging deeper into the simplified data may help us better grasp the biological or ecological processes reflected in the OTU data. Transitioning from this exploration, the synergy of ML and synthetic data enrichment holds promise for advancing microbiome research. This approach enhances classification accuracy and reveals hidden microbial markers that could prove valuable in clinical practice as a diagnostic and prognostic tool.| File | Dimensione | Formato | |
|---|---|---|---|
|
bioengineering-12-00713-v2_compressed.pdf
accesso aperto
Descrizione: Articolo
Tipologia:
PDF editoriale
Licenza:
Creative commons
Dimensione
922.29 kB
Formato
Adobe PDF
|
922.29 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11365/1297754
