Deep Learning(DL) has fundamentally reshaped the landscape of Artificial Intelligence, moving beyond classic structured tasks to tackle complex, unstructured data in fields like Natural Language Processing and Computer Vision. Recent architectural breakthroughs, such as Transformers and Graph Neural Networks (GNNs), have empowered models to effectively process sequential and relational data, creating opportunities for significant advancements in data-rich disciplines like biochemistry and molecular biology. The challenge in these fields, however, is not merely data volume, but managing its heterogeneity, high dimensionality, and the need to generalize from limited labeled samples. This thesis addresses a crucial bottleneck in applying DL to life sciences: the optimal strategy for encoding complex molecular and biological entities. The work is unified by the central theme that computational success is contingent upon the strategic adaptation of both the model architecture and the data representation (structural vs. symbolic) to the specific problem domain. The central contributions of this work are delivered across three distinct, but methodologically interconnected, areas: (i) We introduce and validate novel graph-based neural frameworks for learning structural feature vectors directly from complex molecular graphs, significantly improving the hierarchical classification and automated annotation of Natural Products; (ii) We systematically validate the transferability of large, pre-trained Transformer models from human language to the SMILES chemical language, demonstrating a powerful, resource-efficient methodology for molecular property prediction; (iii) We develop and apply autoencoder architectures to integrate structural and physicochemical properties, creating a latent feature space that enables the data-driven prioritization of materials for green chemistry and sustainable technologies. In conclusion, this thesis establishes and validates a robust computational pipeline grounded in the principle of representation-aware deep learning. The findings provide concrete evidence that substantial progress in chemical discovery and biological inference requires a deliberate methodological choice: optimizing the interplay between the data representation, the model and the specific domain task.

Prete, A.L. (2026). Deep Learning Across Domains: The Role of Representations in Biochemical and Biological Data.

Deep Learning Across Domains: The Role of Representations in Biochemical and Biological Data

Alessia Lucia Prete
2026-03-20

Abstract

Deep Learning(DL) has fundamentally reshaped the landscape of Artificial Intelligence, moving beyond classic structured tasks to tackle complex, unstructured data in fields like Natural Language Processing and Computer Vision. Recent architectural breakthroughs, such as Transformers and Graph Neural Networks (GNNs), have empowered models to effectively process sequential and relational data, creating opportunities for significant advancements in data-rich disciplines like biochemistry and molecular biology. The challenge in these fields, however, is not merely data volume, but managing its heterogeneity, high dimensionality, and the need to generalize from limited labeled samples. This thesis addresses a crucial bottleneck in applying DL to life sciences: the optimal strategy for encoding complex molecular and biological entities. The work is unified by the central theme that computational success is contingent upon the strategic adaptation of both the model architecture and the data representation (structural vs. symbolic) to the specific problem domain. The central contributions of this work are delivered across three distinct, but methodologically interconnected, areas: (i) We introduce and validate novel graph-based neural frameworks for learning structural feature vectors directly from complex molecular graphs, significantly improving the hierarchical classification and automated annotation of Natural Products; (ii) We systematically validate the transferability of large, pre-trained Transformer models from human language to the SMILES chemical language, demonstrating a powerful, resource-efficient methodology for molecular property prediction; (iii) We develop and apply autoencoder architectures to integrate structural and physicochemical properties, creating a latent feature space that enables the data-driven prioritization of materials for green chemistry and sustainable technologies. In conclusion, this thesis establishes and validates a robust computational pipeline grounded in the principle of representation-aware deep learning. The findings provide concrete evidence that substantial progress in chemical discovery and biological inference requires a deliberate methodological choice: optimizing the interplay between the data representation, the model and the specific domain task.
20-mar-2026
Cicaloni Vittoria
XVIII
Prete, A.L. (2026). Deep Learning Across Domains: The Role of Representations in Biochemical and Biological Data.
Prete, Alessia Lucia
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/1310736