Deep Learning Across Domains: The Role of Representations in Biochemical and Biological Data

IRIS

Deep Learning(DL) has fundamentally reshaped the landscape of Artificial Intelligence, moving beyond classic structured tasks to tackle complex, unstructured data in fields like Natural Language Processing and Computer Vision. Recent architectural breakthroughs, such as Transformers and Graph Neural Networks (GNNs), have empowered models to effectively process sequential and relational data, creating opportunities for significant advancements in data-rich disciplines like biochemistry and molecular biology. The challenge in these fields, however, is not merely data volume, but managing its heterogeneity, high dimensionality, and the need to generalize from limited labeled samples. This thesis addresses a crucial bottleneck in applying DL to life sciences: the optimal strategy for encoding complex molecular and biological entities. The work is unified by the central theme that computational success is contingent upon the strategic adaptation of both the model architecture and the data representation (structural vs. symbolic) to the specific problem domain. The central contributions of this work are delivered across three distinct, but methodologically interconnected, areas: (i) We introduce and validate novel graph-based neural frameworks for learning structural feature vectors directly from complex molecular graphs, significantly improving the hierarchical classification and automated annotation of Natural Products; (ii) We systematically validate the transferability of large, pre-trained Transformer models from human language to the SMILES chemical language, demonstrating a powerful, resource-efficient methodology for molecular property prediction; (iii) We develop and apply autoencoder architectures to integrate structural and physicochemical properties, creating a latent feature space that enables the data-driven prioritization of materials for green chemistry and sustainable technologies. In conclusion, this thesis establishes and validates a robust computational pipeline grounded in the principle of representation-aware deep learning. The findings provide concrete evidence that substantial progress in chemical discovery and biological inference requires a deliberate methodological choice: optimizing the interplay between the data representation, the model and the specific domain task.

Prete, A.L. (2026). Deep Learning Across Domains: The Role of Representations in Biochemical and Biological Data.

Deep Learning Across Domains: The Role of Representations in Biochemical and Biological Data

Alessia Lucia Prete

2026-03-20

Abstract

Deep Learning(DL) has fundamentally reshaped the landscape of Artificial Intelligence, moving beyond classic structured tasks to tackle complex, unstructured data in fields like Natural Language Processing and Computer Vision. Recent architectural breakthroughs, such as Transformers and Graph Neural Networks (GNNs), have empowered models to effectively process sequential and relational data, creating opportunities for significant advancements in data-rich disciplines like biochemistry and molecular biology. The challenge in these fields, however, is not merely data volume, but managing its heterogeneity, high dimensionality, and the need to generalize from limited labeled samples. This thesis addresses a crucial bottleneck in applying DL to life sciences: the optimal strategy for encoding complex molecular and biological entities. The work is unified by the central theme that computational success is contingent upon the strategic adaptation of both the model architecture and the data representation (structural vs. symbolic) to the specific problem domain. The central contributions of this work are delivered across three distinct, but methodologically interconnected, areas: (i) We introduce and validate novel graph-based neural frameworks for learning structural feature vectors directly from complex molecular graphs, significantly improving the hierarchical classification and automated annotation of Natural Products; (ii) We systematically validate the transferability of large, pre-trained Transformer models from human language to the SMILES chemical language, demonstrating a powerful, resource-efficient methodology for molecular property prediction; (iii) We develop and apply autoencoder architectures to integrate structural and physicochemical properties, creating a latent feature space that enables the data-driven prioritization of materials for green chemistry and sustainable technologies. In conclusion, this thesis establishes and validates a robust computational pipeline grounded in the principle of representation-aware deep learning. The findings provide concrete evidence that substantial progress in chemical discovery and biological inference requires a deliberate methodological choice: optimizing the interplay between the data representation, the model and the specific domain task.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno della Data di discussione
	
				20-mar-2026
			
	Tutor esterni
	
				Cicaloni Vittoria
			
	Ciclo di dottorato
	
				XVIII
			
	Citazione
	
				Prete, A.L. (2026). Deep Learning Across Domains: The Role of Representations in Biochemical and Biological Data.
			
	Tutti gli autori
	
						Prete, Alessia Lucia
					
	Appare nelle tipologie:
	
				8.1 Tesi Dottorato

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/1310736