Autonomous Car Data: Going Driverless? Get Excited About Petabytes

5 min readOct 20, 2021

If you search YouTube for “fully autonomous vehicles,” you’d swear the long-promised future of self-driving cars, a la Total Recall or Batman, was already here. Tesla proclaims we are seeing “the beginning of the end for human-driven cars.” Google-owned Waymo released a video showing that “fully autonomous driving technology is here.”

Surely, after so many years of development, the long-awaited goal of fully automated vehicles must be arriving very soon.

Not quite.

In fact, maybe not even in this decade. Reality is always harder than hype.

How Self-Driving Cars Use Big Data and AI

Fully driverless cars require sophisticated artificial intelligence (AI), all of which is based on massive training efforts with potentially petabytes of source data. Driving around cones on an empty racetrack is one thing; navigating through dense urban streets with random pedestrian traffic and distracted adjacent drivers is something else.

Getting the autonomous driving industry from the former to the latter poses an intellectual and computing challenge spanning from datacenters to edge nodes (cars).

Will we get to fully autonomous driving? Almost certainly. But it’ll likely require years of working through the stages of autonomous driving, refining technology, and adapting infrastructure as we go. Much of making this future come to pass will depend on ever-improving ML training and efficiently handling the mountains of data that requires.

Leveling Up — 5 Levels of Self-Driving Cars

On the road from fully manual cars to Total Recall’s Johnny Cabs, the Society of Automotive Engineers has defined six levels:

Level 0

No automation. If you learned to drive before the 1990s, it was probably in a Level 0 vehicle.

Level 1

Driver assistance for steering, speed, or braking. Think of cruise control or anti-lock braking systems.

Level 2

Partial automation. The car can control multiple functions simultaneously. Level 2 is the highest level now available commercially, including from Tesla. Think of it as “hands-free until needed.”

Level 3

Conditional automation. The car has a human driver present, but the vehicle can perform all driving tasks under normal conditions. The Honda Legend is Level 3, but, according to Reuters, only 100 will be sold in Japan. U.S. regulations currently don’t allow Level 3 vehicles.

Level 4

High automation. Essentially, the vehicle can handle everything, and the driver can sleep peacefully, but a driver still needs to be present, just in case. Most autonomous driving R&D now focuses on Level 4.

Level 5

Full autonomy. No driver is needed in the car. For that matter, neither is a steering wheel. When some vehicle makers talk about redesigning cars from the ground up for autonomy, this is what they mean. When humans are not involved in the driving process, you may not need a “driver’s seat” or a steering wheel. Even the shape of the vehicle may change to optimize for sensor placements.

As a point of reference, the Apple Car is slated to be Level 4, and industry rumors don’t expect a launch until 2025 or 2026.Data Load — How Much Data Will an Autonomous Car Generate?Not surprisingly, the higher one moves up the autonomous driving level ladder, the larger the data load required. In 2015, a Hitachi paper (now missing from its site) estimated that connected cars equipped with mobile broadband would upload 25GB of data to the cloud every hour. In 2016, Intel predicted that autonomous vehicles would generate (very roughly) 4000GB of data per day divided across the following device types:

Cameras: 20–40MB per second
Radar: 10–100KB per second
Sonar: 10–100KB per second
GPS: 50KB per second
Lidar: 10–70MB per second

These numbers actually underestimated when compared to a Lucid Motors presentation given at Flash Memory Summit the following year.

Intel noted, “Every autonomous car will generate the data equivalent of almost 3,000 people. Extrapolate this further and think about how many cars are on the road.” One estimate from AAA noted that a single car could generate up to 5100TB of data annually.

Handling the Load — How Do Self-Driving Cars Collect Data?

The mix of sensors may change. The algorithms may evolve.

But the fundamental idea will remain constant: Autonomous driving requires a lot of data from a lot of sensors, and that data will need to be handled in different ways.

In terms of real-time driving, much of it must be handled in-car, because there’s simply no time to send it to the cloud for processing. However, the AI software that governs those split-second operations must all be trained by absorbing and distilling countless thousands of hours of vehicle footage. Waymo alone collected data on five million driven miles from June 2015 to February 2018. That’s where data centers come in.

The training process is much the same as with conventional computer vision and machine learning. Researchers create models to identify objects. It’s so much data that the cars go into the garage to download it via 100G ethernet! Much of this data gets pre-processed at the edge, both to filter out unneeded data and to apply the metadata and object tagging that will help with neural network training.

Then researchers perform testing and algorithm refinement with those mountains of data. Obviously, accuracy and reliability are critical for driving applications, and, as with other AI models, the larger the training dataset, the greater the accuracy levels.

Of course, neural network training is not a one-and-done affair. Researchers need to run as many simulations as possible, compare results and tweak parameters and routines in a never-ending quest for improvement.

To make this process feasible, a high amount of compute parallelism is required, as well as ample network and storage bandwidth. Many infrastructures struggle with these loads because they were designed for scale-up solutions rather than scaling out, which, when done properly, provides lower latency and superior ability to handle dataset growth while maintaining performance and affordability.

One issue with autonomous vehicle training is that all data is “hot.”

Researchers don’t want to break the dataset into smaller chunks, because then the model is working with less data per run, which can impair accuracy. The entire dataset can be tens of petabytes, potentially even 100PB.

Pre-processing can minimize some of this, but not much. Image and video data is notoriously difficult to compress without loss. Many storage architectures aim to keep hot data in flash, but at these volumes doing so is startlingly cost-prohibitive.

Quobyte — A Solution for Efficient Autonomous Cars Data Processing

Yet yes, it is possible to achieve high-bandwidth, reliable, affordable storage for neural network training at the scale demanded by autonomous driving efforts.

Quobyte is proof.

Quobyte’s scale-out approach eliminates the bandwidth bottlenecks that choke machine learning algorithms that crunch massive datasets.

An effective storage platform can significantly improve the ROI on mass-scale neural learning efforts, and that’s why storage can’t be an afterthought when planning such projects. Storage needs to be right there at the beginning with computing because there’s no point in having one without the other.

Fortunately, Quobyte capability grows in a nearly linear fashion with on-demand computing, scaling smoothly from tens to thousands of servers.

With Quobyte, we have the future-proof infrastructure necessary to help the auto industry level up throughout this decade. Soon, maybe sooner than most expect, we look forward to hailing our first Johnny Cab.

Interested in Quobyte?

Schedule a call with us to learn more about Quobyte and our Editions.

Reach out to us!

Originally posted on Quobyte’s blog on October 21, 2021.