Imagine if scientists could take a look at people’s DNA and figure out if they were prone to a major disease. Or if doctors could take a look at a patient’s DNA and based on that, give them better medication for their diseases. Nowadays, this is possible with technologies such as DNA sequencing and whole genome sequencing (WGS). However, it is important to overcome data storage challenges to accelerate and improve the genome sequencing workflows. Find out how this is achieved throughout this article.
What is Genome Sequencing?
Genome sequencing can help identify genetic variants which can lead to diseases, as well as help us understand how our bodies would react to certain medications. It is without a doubt that genome sequencing can help cure, prevent, or even minimize the impact major diseases have had on human beings.
The goal of genome sequencing is to determine the order or sequence of every single chemical base that makes up your genome. The first human genome was sequenced in the early 2000s; it needed the collaboration of many scientists across the world, and it took about thirteen years to complete. It also cost over three billion dollars. This was the result of the Human Genome Project, which had as a goal to sequence and map all the genes of human beings.
Today, you can get your genome sequenced for less than one thousand dollars, and you can get the results in about two days.
How does Genome Sequencing Work?
First of all, a DNA sample needs to be sent to the lab; these DNA samples can be found in blood, saliva, hair with follicles, etc. Then, one of the most common ways to sequence a genome is to extract the DNA and break it down into smaller fragments. After that, lab specialists need to make hundreds of thousands copies of those DNA fragments. Then a batch of colored DNA bases (A’s, C’s, G’s, and T’s) and enzymes are added to the genome pieces trying to be read. In the genome piece, the special colored bases bind to the opposite base.
Then, the pieces of DNA with colored bases are passed through a laser, a detector reads the color of each base, and software is used to match the color of the base. So the entire sequence is generated one base at a time. The sequencer spits out raw sequences, which are short DNA sequences of a genome. Next, the millions of raw sequences are combined with computer programs to create a sequence of the entire genome.
Once your genome has been sequenced, it is compared to a reference genome. This comparison yields the difference between your genome and the reference genome. Some of these differences may include bases being in different positions or just missing from your genome.
The differences between two human genomes are responsible for the differences between two people. The way they look, the way they act, the diseases they are prone to, and how their bodies may react to different medications, among others differences. These are great benefits because now scientists can determine if someone is prone to mental disorders and serious diseases. This means that with the help of other technologies, patients can receive better treatments.
The future for genome sequencing is very promising as it will have a positive impact on people’s lives down the line. However, it is important to address some of the challenges it might face due to the amount of data it generates.
Genome Sequencing Data Storage
Genome sequencing generates billions of small files; therefore, it requires a storage platform capable of providing low latency to process large amounts of files in short periods of time. Additionally, high-performance computing is required to process all the files coming from the sequencer.
A very common process from the genome sequencing workflow is staging in and staging out. This process requires copying files to a local drive and then starting a computational job. This wastes a lot of time and resources. For this reason, the genome sequencing workflow greatly benefits from a fast file system capable of serving data fast. In other words, a fast storage should be able to help avoid the process of staging in and out.
Also, since different computer programs are needed in the genome sequencing workflow, the storage platform should be accessible from different interfaces and should integrate well with other software.
In addition, since object store is already used in genome sequencing workflows, the storage platform needs to support object storage. These requirements add an extra element, which is data management; which to facilitate the workflow, should be highly automated.
Because of the great amount of data generated to sequenced a single genome, a cost-effective storage capable of providing both capacity and performance is needed. Since flash provides great performance, it is very important to be able to use it as much as possible. However, using disk or HDD for storing data for long periods of time is as important as well.
In the case of genome sequencing, since it is an evolving technology that generates great amounts of data, it is essential for the storage platform to be scalable. Capacity should be scalable because as more people decide to get their genome sequenced, even more data will be generated. Additionally, if new sequencing technologies require more performance, the storage platform should provide an easy way to scale performance. A scalable storage platform provides the flexibility that genome sequencing workloads demand.
The storage should also offer proper security as genome sequencing handles very sensitive data, i.e. human information. So, it is very essential that the storage can keep the data safe from hackers or unauthorized personnel. The storage platform should facilitate data regulations compliance.
Quobyte is a distributed scale-out file system that linearly scales capacity and performance. To learn how Quobyte accelerates life-science workflows read more here.
Originally posted on Quobyte’s blog on September 26, 2022.