How Storage Underpins Tomorrow’s Protein Folding Breakthroughs
Some readers may recall the rise and spread of “mad cow disease” in the early 1990s, more accurately known as bovine spongiform encephalopathy (BSE). With symptoms that resembled dementia and wasting, the fatal disease proved transmissible to humans via ingestion of contaminated tissue and resulted in millions of cows being slaughtered to protect international food supplies. BSE is generally believed to be caused by a misfolded protein, or prion. The same is true for Alzheimer’s and Parkinson’s diseases and perhaps also multiple sclerosis (MS).
It’s not surprising that BSE introduced much of the world to prions. It is surprising that we’ve made so little progress in understanding their inner workings over the past 30 years, although — at last — we finally have significant, recent advances in the modeling and economics around protein folding.
Old School Methods
Proteins are the body’s building blocks and nanomachines. They are comprised of amino acid strings. The number and order of amino acids in those strings determine how the protein will twist and fold, and the dynamics of those specific three-dimensional structures then determine how the protein operates. On average, there are over 40 million proteins in every cell. The mechanics and variability of protein folding are so complex that, even today, we remain unable to predictively create proteins that will behave in specific, desired ways. Instead, we use trial and error, cycling through endless, expensive rounds of creation, observation, and assessment.
When I wrote my Master’s thesis on 3D protein structure alignment 15 years ago, the most common method for 3D modeling of protein folding was through x-ray crystallography. This technique involves placing a protein within a supersaturated solution, then crystalizing the solution around and through the protein. Technicians then direct an x-ray beam into the crystal, which diffracts the beam like a prism. As x-rays pass through the protein’s atomic structures, electron clouds scatter the x-rays in discernible ways. Sensors gather exiting x-ray energy, and the resulting density maps go on to form the basis of a 3D atomic model of the protein.
As you might guess, x-ray crystallography was and remains an arduous, expensive process. Creating the crystallized sample is laborious. Getting a protein to crystallize can take weeks to many months, and some proteins won’t crystallize at all. Machines are large, with costs running into the millions of dollars. Systems require specialized rooms with anti-seismic motion protection and highly specialized staff for operation. This process generates mountains of data — often over 200TB per run — and the computers that process that data need to be analyzing around the clock at peak performance levels. Just as airplanes only make money when they’re in the air, the systems used for protein modelling only deliver acceptable ROI when they keep up with the data load.
To reduce that data load, researchers often resort to lossy compression. If you compare the pristine audio from a vinyl record against its streaming MP3 counterpart, you intuitively understand the danger of lossy algorithms. With x-ray crystallography, loss appears in lower microscope precision, which can sacrifice modeling accuracy and extend the time required for computation. Ideally, you want to retain all raw data, if only because runs are costly, but this requires having a very high-bandwidth solution that integrates high-performance storage for processing with cost-effective bulk storage for long-term retention, all running under a software platform designed for mass-scale, cluster-centric storage. Unfortunately, the NFS architecture used by many life science labs fails in this regard, and loss of precision remains necessary.
Cryo-EM: Help for Folding Arrives
In the years since x-ray crystallography rose to prominence, other modelling methods have evolved, but all have exhibited similar levels of cost and burden. Cryo-electron microscopy (cryo-EM) may be first genuine leap in the 3D protein modelling field.
Cryo-EM does not use crystallized samples. Rather, researchers essentially flash-freeze a protein in an aqueous suspension. The process is so fast that ice crystals don’t have time to form. Without these crystals to obstruct beams, researchers can employ transmission electron microscopy (TEM) to shoot electrons through the sample. The ways in which the protein’s electrons scatter creates patterns that subsequent analysis then converts into 3D models.
Cryo-EM’s process is quicker, simpler, and more applicable to the full spectrum of proteins for creating high-quality samples. However, cryo-EM machines still cost millions of dollars and carry similar environmental and staff requirements when compared to x-ray crystallography. Fortunately, the cost is counterbalanced by the results. In 2020, two research labs demonstrated how cryo-EM could achieve 3D protein details down to the resolution of individual atoms.
Of course, greater resolution means more data. Generating the data for an atomic-resolution cryo-EM scan requires hours to days. The results are likely worth it, though. Combined with new analysis tools, such as the freely available AlphaFold from Google’s DeepMind, cryo-EM brings researchers closer than ever before to bridging accurate protein identification with synthesized protein function prediction.
There remains tens of millions of distinct proteins that have yet to be mapped and analyzed. The faster we can map that landscape and combine that knowledge with predictive tools, the sooner we can create remedies for some of the world’s most terrible, destructive diseases. As we’ll discuss in the next post, though, storage infrastructure will prove critical in harnessing this next-gen data deluge and achieving breakthrough scientific success.
Want to Learn More About Quobyte?
Learn how Quobyte provides scalable high-performance storage for Life Sciences
Originally posted on Quobyte’s blog on February 15, 2022.