Machine learning methods for the prediction of translation speed

Giacomini, Giorgia

doi:10.25434/giorgia-giacomini_phd2021

Ribosomes carry out protein synthesis from mRNA templates by a highly regulated process called translation. Translational control plays a key role in the regulation of gene expression, under physiological and pathological conditions. How translation is regulated under different conditions and what factors greatly influence the translation speed remains open questions in molecular biology. In recent years, Ribosome profiling technique (Ribo-seq) has emerged as a powerful method for globally monitoring the translation process in vivo at single nucleotide resolution [1]. Ribo-seq is based on deep sequencing of mRNA fragments covered by ribosomes, called Ribosome Protected Fragments (RPFs). Sequencing of RPFs allows to record the precise position of the ribosomes at the time in which the translation was blocked. However, the exploitation of the full power of this technique is hindered by notable weaknesses (e.g. a low signal to noise ratio), influencing the reproducibility of Ribo-seq experiment. [2]. The aim of this thesis is the development of a newly designed statistical approach integrated with machine learning methodologies for a comprehensive understanding of the information contained in Ribosome Profiling data and for prediction of translation speed. Our data analysis approach consists of a systematic comparison of Ribo-seq profiles referring to several publically available Ribo-seq datasets generated in different laboratories, in different time but under the same experimental conditions. In the E.coli case studio, the analysis of 3588 Ribo-seq profiles across eight independent datasets revealed that only 40 profiles are significantly reproducibles. The identification of reproducible Ribo-seq profiles allows us to build consensus sequences which highlighted the nucleotides located within fast and slow regions. The density of the RPFs along the mRNAs reflects the different time spent by ribosomes in translating each part of the ORF. Therefore slow regions, extremely rich of ribosomes, and fast regions, characterized by few ribosomes, can be easily identified by Ribo-seq. We analysed the occurrences of nucleotides, dinucleotides, and codons of consensus sequences in order to conjecture the existence (or not) of signals in the sequence that could modulate the speed of translation. To this aim, we implemented different neural network architectures that let us classify the translation speed of the previously identified consensus sequences with high accuracy. Although the limited amount of data, the results clearly demonstrate that the models can extract useful information. Furthermore, we used the significantly reproducible profiles as a reference for comparative analyses aimed at detecting whether modifications in experimental conditions (heat shock stress and aminoacid starvation) could affect the reproducibility of our Ribo-seq workflow and thus influence the translation control. A preliminary analysis on Ribo-seq human data suggests that our method provides a rich resource for further in-depth studies about translation control of gene expression in all kind of Ribo-seq datasets, including those related to highly differentiated organisms like humans.

Giacomini, G. (2021). Machine learning methods for the prediction of translation speed [10.25434/giorgia-giacomini_phd2021].