The quote “Not everything that can be counted counts and not everything that counts can be counted”, often attributed to Albert Einstein, expresses in some extent the challenges we are facing when dealing with the human genome. The unprecedent amount of data derived from sequencing experiments forced us to find something that counts within an overwhelming number of genetic variants. In the present thesis, we try to assess this issue in the context of Coronavirus disease 2019 (COVID-19), an infectious disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). While most infected individuals experience only mild or no symptoms, severe cases can rapidly evolve toward a critical respiratory distress syndrome and multiple organ failure. COVID-19 has demonstrated itself to be a heterogeneous and multifactorial infection having a broad spectrum of clinical presentations influenced by age, gender, comorbidities, ethnic groups, and host genetics, including human leukocyte antigen (HLA) genotypes. In this challenging context, our aim was to study host genetic factors associated with COVID-19 severity. A better understanding of the interplay between host genetics and SARS-CoV-2 is, in fact, essential for disease prediction and to support the development of targeted therapies. Several efforts have been done worldwide to discover the genetic determinants of COVID-19 susceptibility, severity, and outcomes. As a matter of fact, COVID-19 represents one of the hot research topic areas for its relevance among the whole community (The COVID-19 Host Genetics Initiative, HGI, and the COVID Human Genetic Effort, HGE, Consortia). This dissertation presents a novel approach to identify host risk factors predisposing to the disease. The innovation consists in taking into account different aspects of genome variability, from Single Nucleotide Variants (SNVs) to Copy Number Variations (CNVs) through a gene-based approach to represent genetic data. The gene-based Boolean representations were the input features of machine learning models and were tested separately and ultimately all together to improve our ability to predict COVID-19 outcomes and to identify genes and variants predisposing to severe outcomes. Overall, this method led us to identify some important genetic determinants involved in COVID-19 severity that will be discussed in the final chapters of the thesis. The first Chapter of this thesis will provide an overview of the background and state of the art technologies to guide the reader in the comprehension of the work. Chapter 2 will provide an exhaustive description of the bioinformatic pipelines, optimization procedures and methods adopted in our work. Chapters 3 and 4 will show our first findings and introduce the reader to the complexity of the study. The effective applications of our novel approach, i.e., the Boolean features and machine learning model, are reported in Chapter 5, 6 and 7. The last chapter of the results, Chapter 8, will discuss the challenges and results of the application of machine learning methods on Boolean features representing copy number variants. The main stages and discoveries of our research will be reported and commented in the Concluding remarks, that end the dissertation on Chapter 9.

Benetti, E. (2021). Identifying host genetics risk factors for COVID-19 from Exome Sequencing [10.25434/benetti-elisa_phd2021].

Identifying host genetics risk factors for COVID-19 from Exome Sequencing

Benetti, Elisa
2021-01-01

Abstract

The quote “Not everything that can be counted counts and not everything that counts can be counted”, often attributed to Albert Einstein, expresses in some extent the challenges we are facing when dealing with the human genome. The unprecedent amount of data derived from sequencing experiments forced us to find something that counts within an overwhelming number of genetic variants. In the present thesis, we try to assess this issue in the context of Coronavirus disease 2019 (COVID-19), an infectious disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). While most infected individuals experience only mild or no symptoms, severe cases can rapidly evolve toward a critical respiratory distress syndrome and multiple organ failure. COVID-19 has demonstrated itself to be a heterogeneous and multifactorial infection having a broad spectrum of clinical presentations influenced by age, gender, comorbidities, ethnic groups, and host genetics, including human leukocyte antigen (HLA) genotypes. In this challenging context, our aim was to study host genetic factors associated with COVID-19 severity. A better understanding of the interplay between host genetics and SARS-CoV-2 is, in fact, essential for disease prediction and to support the development of targeted therapies. Several efforts have been done worldwide to discover the genetic determinants of COVID-19 susceptibility, severity, and outcomes. As a matter of fact, COVID-19 represents one of the hot research topic areas for its relevance among the whole community (The COVID-19 Host Genetics Initiative, HGI, and the COVID Human Genetic Effort, HGE, Consortia). This dissertation presents a novel approach to identify host risk factors predisposing to the disease. The innovation consists in taking into account different aspects of genome variability, from Single Nucleotide Variants (SNVs) to Copy Number Variations (CNVs) through a gene-based approach to represent genetic data. The gene-based Boolean representations were the input features of machine learning models and were tested separately and ultimately all together to improve our ability to predict COVID-19 outcomes and to identify genes and variants predisposing to severe outcomes. Overall, this method led us to identify some important genetic determinants involved in COVID-19 severity that will be discussed in the final chapters of the thesis. The first Chapter of this thesis will provide an overview of the background and state of the art technologies to guide the reader in the comprehension of the work. Chapter 2 will provide an exhaustive description of the bioinformatic pipelines, optimization procedures and methods adopted in our work. Chapters 3 and 4 will show our first findings and introduce the reader to the complexity of the study. The effective applications of our novel approach, i.e., the Boolean features and machine learning model, are reported in Chapter 5, 6 and 7. The last chapter of the results, Chapter 8, will discuss the challenges and results of the application of machine learning methods on Boolean features representing copy number variants. The main stages and discoveries of our research will be reported and commented in the Concluding remarks, that end the dissertation on Chapter 9.
2021
Benetti, E. (2021). Identifying host genetics risk factors for COVID-19 from Exome Sequencing [10.25434/benetti-elisa_phd2021].
Benetti, Elisa
File in questo prodotto:
File Dimensione Formato  
phd_unisi_085334.pdf

accesso aperto

Descrizione: Tesi di Dottorato
Tipologia: PDF editoriale
Licenza: PUBBLICO - Pubblico con Copyright
Dimensione 16.9 MB
Formato Adobe PDF
16.9 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/1160873