The study of DNA sequences has become indis-pensable for basic biological research, and in numerous applied fields such as comparative genomics, evolutionary biology, pan genomics, genetics of disease, regulation of gene expression, oncology and many others, all supported by bioinformatics. In the era of Cloud computing, federating the Cloud systems of different genetics research organisations paves the way towards a new era of data sharing and new mashup services and applications. However, due to the huge amount of genomics data (genomics Big Data) that have to be managed, a parallel distributed NoSQL DataBase Management System (DBMS) approach becomes fundamental. Specifically, due to the textual nature of genomics data, a NoSQL DBMS appears to be the most suitable solution. In this paper, by considering the whole human genome, we present a preliminary study comparing this latter using MongoDB with a SQL-like database solution, i.e., MySQL in order to look for DNA sequences. Moreover, in order to optimize the research of genomics codes, we adopt hash functions that allow mapping nucleotides sequences of arbitrary size onto data of a fixed smaller size. Experiments, shows that MongoDB apart simplifying the management of genomics data provides better performances.
Celesti, F., Celesti, A., Galletta, A., Fazio, M., Villari, M. (2019). Optimizing the Research of DNA Sequences in a NoSQL Document Database: A Preliminary Study. In Proceedings - IEEE Symposium on Computers and Communications (pp.1153-1158). 345 E 47TH ST, NEW YORK, NY 10017 USA : Institute of Electrical and Electronics Engineers Inc. [10.1109/iscc47284.2019.8969697].
Optimizing the Research of DNA Sequences in a NoSQL Document Database: A Preliminary Study
Celesti, Fabrizio;
2019-01-01
Abstract
The study of DNA sequences has become indis-pensable for basic biological research, and in numerous applied fields such as comparative genomics, evolutionary biology, pan genomics, genetics of disease, regulation of gene expression, oncology and many others, all supported by bioinformatics. In the era of Cloud computing, federating the Cloud systems of different genetics research organisations paves the way towards a new era of data sharing and new mashup services and applications. However, due to the huge amount of genomics data (genomics Big Data) that have to be managed, a parallel distributed NoSQL DataBase Management System (DBMS) approach becomes fundamental. Specifically, due to the textual nature of genomics data, a NoSQL DBMS appears to be the most suitable solution. In this paper, by considering the whole human genome, we present a preliminary study comparing this latter using MongoDB with a SQL-like database solution, i.e., MySQL in order to look for DNA sequences. Moreover, in order to optimize the research of genomics codes, we adopt hash functions that allow mapping nucleotides sequences of arbitrary size onto data of a fixed smaller size. Experiments, shows that MongoDB apart simplifying the management of genomics data provides better performances.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11365/1278074
Attenzione
Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo