Summary: Detailed, data-driven analysis of plankton can yield remarkable insights into the state of oceanic health and the world’s food chain. Collecting and crunching that data relies on efforts by Oregon State University’s Hatfield Marine Science Center and the OSU Center for Quantitative Life Sciences. Executing this ongoing big data project requires very powerful IT infrastructure driven, in part, by Quobyte storage. Quobyte’s functionality and cost efficiency help make the project robust, performant, easily managed, and globally available.
By now, most people have at least heard about bee colony collapse disorder and at least understand its bottom line: If the bees die out, our food chain dies with them. Somewhat less widely understood is the role that plankton plays in the oceanic food chain. Phytoplankton, like plants, use photosynthesis to consume vast amounts of carbon dioxide and produce oxygen throughout the seas. In turn, animal-like zooplankton consume phytoplankton, and zooplankton, while famously consumed in massive quantities by whales, also form the diet foundation for countless small fish. A plankton die-off can have massive negative impacts on humanity and world ecology. Thus, it makes sense to keep a close eye on plankton populations around the world and track them over time.
That’s where the difficulty — and fascinating computer science — starts. It’s a journey that begins on the smallest scales in the world’s seas and reaches success in some of the most advanced high-performance computing (HPC) infrastructure available today.
Capturing the Invisible
Most plankton are microscopic, which makes manually counting them maddeningly laborious and infeasible to scale in an extensive study. Seven years ago, Kaggle and Booze Allen Hamilton teamed to sponsor the first National Data Science Bowl, which offered $175,000 in prize money to teams that could devise the best algorithms for analyzing data provided by Oregon State University’s Hatfield Marine Science Center.
That data originated (and continues to flow) from a sampling project with the Hatfield Marine Science Center’s Plankton Ecology Lab. The lab sends staff aboard a National Science Foundation ship for week-long collection expeditions, dragging an in situ ichthyoplankton imaging system, or ISIIS (pronounced like the Egyptian goddess). An ISIIS is essentially a camera and LED lamp mounted on a submersible frame that uses fins to help control underwater depth. The LED generates a narrow light band. As particles and plankton pass through this light, their shadows fall into a line-scanning 8K camera’s field of focus. The camera’s exceptional resolution can capture particles measuring only 1 mm to 13 mm in size while the boat is zipping along at 5 knots (just under 6 miles per hour).
Not surprisingly, images of cast shadows, especially when projected in moving water, tend to be fuzzy. This complicates the job of isolating plankton from other particles as well as identifying the specific plankton(s) in any given image. This video offers a solid overview of the ISIIS, its operation, and subsequent data analysis.
Putting Data in the Pipeline
According to Christopher Sullivan, Assistant Director for Biocomputing at the OSU Center for Quantitative Life Sciences (CQLS), each plankton imaging excursion involves five days of imaging, resulting in hundreds of terabytes of stored video. When the research vessel docks at the Hatfield Marine Science Center in Newport, Oregon, the data streams from the ship’s onboard server room across a dedicated 200 Gbps network line and into the CQLS located roughly 50 miles away across the coastal mountains. Once on the CQLS servers, the data center processes the raw data with artificial intelligence fueled by one of the winning algorithms from that National Data Science Bowl competition. However, despite the algorithm’s genius and utility, the plankton workload represents a towering amount of analysis, even for OSU’s supercomputing resources. Fortunately, the CQLS recently upgraded to a range of specialized compute and storage infrastructure optimized for exactly this type of job, including optimizations around how GPUs were implemented to accelerate highly parallelized workloads.
“We implemented GPU technologies with IBM that were actually socketed to the motherboard,” says Sullivan. “Normally, one of the biggest problems with using GPUs is getting data in and out of the GPU through the PCIe Gen 3 bus. It’s not designed to move this much data, and it just chokes. But being able to put the GPU on the motherboard, along with the processor and memory, allowed us to move our data directly into the GPU almost 60x faster. We went from processing things in weeks to days.”
The ability to capture more data, having larger datasets, and the ability to process those datasets in feasible time frames all go hand in hand and collectively yield some subtle benefits. For example, Sullivan notes that working through so much more data allows scientists to remove bias. Larger sample sizes, in addition to combining multiple datasets, allows for broader questions that are less predisposed to point toward given answers.
Reaching those massive datasets, though, is no small task. The plankton data begins at roughly 100TB of raw video data that uploads onto Quobyte storage at CQLS. Initial processing at CQLS scans the video frame by frame, chops out every speck, and turns that speck into a unique JPEG. Thus, one dataset can result in billions of smaller images, all of which yield about 300TB of pictures. However, water conditions can vary between expeditions, locations, and even elevations. Water color, amount of light, amount of detritus, types of life forms present — scores of variables will make each dataset different from others. This requires running data through the project’s neural net and verifying the results.
“We have confirmed AI trainings that we know work,” says Sullivan. “We’ll evaluate a new dataset with this training, see what we get back, and then validate by hand. Once we know everything is OK, then the dataset can start feeding into the larger cyber infrastructure. If everything’s not OK, like if all results come back as detritus, then we need to construct a modified training that not only works for the current dataset but also won’t invalidate all the work we’ve done before. This is where we use our IBM AC922 machines. Without the IBMs, a training would take about six weeks. With them, it’s about one week.”
Extending this further, CQLS, working with its infrastructure partners, can run through an entire expedition workload, including any necessary training, in two to three weeks. Sullivan says that without his current workflow and infrastructure, it would take several months.
Managing the Load
Once the dataset validates through training, CQLS can divide the workload into pieces, which can then either run on-site or upload to infrastructure partners for subsequent analysis. From here, researchers adapt the processed data into expansive CSV spreadsheet files, which then get shared among CQLS and its partners.
Spotlight on Storage Support
Oregon State’s Christopher Sullivan readily admits that no storage solution is perfect. CQLS could have selected a free storage architecture, and it could have deployed faster SSDs rather than higher-capacity hard drives. At every step, there are trade-offs that must be judged against budgets and objectives.
“Every deployment has pain,” he says. “The question is whether you can keep the pain in check. For example, our initial throughput with Quobyte wasn’t quite where we expected. But Quobyte support stepped in and worked with us to get us where we needed. Their support is amazing.”
This support, combined with Quobyte’s attractive economics, allow Oregon State to fulfill its promise to provide backup services to all its datacenter clients. Customers need only provide CQLS with the drives they wish to use for storage. Sullivan’s group handles the rest, including backup services at no charge.
“We only charge people our server cost, never more than what we pay. That’s the university’s policy.” He adds with a laugh, “I wish I could use it to make money for our group, but backup is truly a gift to the community from the university. It’s our responsibility.”
All “hot” data at CQLS resides on its 2.5PB of Quobyte storage. Sullivan states that one of Quobyte’s greatest advantages is its superior tenancy handling, which is essential when managing many HPC projects, each of which (like the plankton study) may integrate multiple collaborators. In particular, Quobyte enables quality of service (e.g., hot and cold prioritization) based on tenancy, even down to the folder level, while still providing high, cost-efficient storage density per server.
Once analysis completes, CQLS copies its source video data onto commodity hard drives, places those drives into cold storage, then wipes the source content from the Quobyte servers to prepare for the next dataset from Newport.
Another Quobyte advantage is its flexibility with deployment locality. With CFS infrastructure, CQLS must keep servers in close proximity. As the computing group has grown, this proximity requirement has created strain in how much power the school can bring into a given location; it’s literally constrained by the electrical grid and the university’s power infrastructure. With Quobyte, though, architectural advantages allow the system to operate over a broader area, allowing CQLS to scale more readily and optimize its physical resources more efficiently.
Why Petabytes Matter
All that data and infrastructure amounts to far more than an interesting analytics exercise. Plankton conditions offer a unique insight into the health of the oceanic ecosystem and the state of climate change’s impact upon it. This also applies to chemical agents generated through industry and sometimes passed to the oceans through our agricultural and pharmaceutical practices. At a time when nature has been largely politicized, the value of objective facts, trends, and correlations can be immense.
The biggest questions do have answers, but reaching those answers often depends on having the best technologies and analysis methods available. This is why CQLS relies on optimized computing infrastructure, including Quobyte as a vital part of its storage solution.
If CQLS’s compute and storage infrastructure can’t keep up with that weekly dataset cadence, the entire project risks becoming log jammed. But the demand is actually much higher than that, because CQLS is also managing partial plankton datasets arriving from France, Japan, and Brazil, which handle some of their own compute needs. Strong and innovative as CQLS’s facilities are, cloud processing support from the National Science Foundation proved essential for staying on top of the workloads. All of this depends on Sullivan’s team having designed these workloads for fractional distribution, allowing data subsets to move across resources from desktop PCs to top-end HPC clusters wherever and whenever needed for optimal processing. Sullivan notes being able to see where the group’s data is being processed in real time on a spinning globe model.
Want to Know More About Quobyte?
Schedule a call with us to learn more about Quobyte and our Editions
Originally posted on Quobyte’s blog on January 19, 2022.