5 simple improvement areas for your weather data storage

Meteorological data can be a bit of a beast to deal with from a storage perspective. Ideally you want your data physically close to your compute resources to reduce network latency during access, and uncompressed to optimize read times and allow you to target just the bytes you need. But at the same time you also don't want to blow your entire IT budget on storage space. For most weather companies, unoptimized data storage practices always rank highly in the "how we piss money away" category, coming in a second only to wasted compute resources.

Weather data comes in all shapes and sizes, and there's no one-size-fits-all best solution for storing it, but there are a lot of things you can do to improve your storage situation, with varying levels of complexity and payoff. We'll kick things off in this post with some of the simplest, lowest-hanging fruit solutions you can employ, and talk about some more involved techniques in future posts.

Don't archive what's already archived

No shit right? While this sounds obvious, duplicated data shows up an alarming amount of time. And not all duplicated data is bad, such as in the case of a solid backup strategy or the temporary duplication of data while testing an update to a data download process. What I'm talking about here falls into two main camps:

Internally duplicated data, i.e. copies of the same data spread across different locations in your internal network. One of the major reasons for this is duplicating data archives across development and production environments, usually as a result of having data scrapers running in both environments. While I'm all for testing in separate environments, once your scraper has passed testing and is deployed in production, it's time to shut down the development version. Any processes in your development environment that need to read data can read that data from the production archive. Grant read-only privileges to the archive from development and sleep better at night knowing your data is protected and you've cut your processing and storage costs in half.
Duplication of an external archive. Many people seem to have an obsession with "owning" data. They're not content to just pull data from an external archive when they need it, they have to have their own copy on standby, ready to go at a moment's notice. This makes sense if access to the external archive is unreliable or only holds a sliding window of data and you want to retain the older stuff, but not so much if you're just copying data from their S3 bucket to yours in the same region. NOAA's Big Data Project makes a ton of data available through it's joint ventures with Google and Amazon and it doesn't cost you a bean, so why pay extra just to have a copy of it in your own bucket?

2. Leverage the appropriate object storage tiers

This one is as easy to overlook as it is to fix. When first developing a new data scraper, it's common practice to throw the data into object storage, and the bucket is usually created in the "standard" tier of your preferred cloud services provider due to its responsiveness. Unfortunately we often forget to go back and revisit that choice 6 months or a year later when we have a clearer picture of the access patterns of the data in question.

If you're just stockpiling data that's touched infrequently (if ever), moving the data to a cheaper storage tier can make a big difference in your bottom line. With most cloud providers this is as simple as a few clicks of the mouse in the console or a one-liner CLI command. I suggest putting a recurring meeting on the IT calendar to revisit your storage once every quarter simply to update the storage tiers as necessary - a few hours each month can reap major budgetary benefits.

3. Only keep what you need

I know, we're back to the "no shit" response, but I bet you $100 if we look at the data you're storing right now, you're keeping data you don't need. I'm not even talking data you might one day need if the planets align. I'm talking data you absolutely will never, ever touch. Like our earlier entry in the "no shit" offering, this one falls into two categories:

Old-ass data. You know this data. It's been sitting in some dusty corner of your storage since the Dark Ages and probably offers a similar degree of intellectual value. More importantly, it's so old that it's basically only useful for climatological studies at this point, only your shop doesn't perform climatological studies, not to mention the same data is readily available through half a dozen online archives (see point #1). Do not be afraid to clean house. Better yet, agree on an internal policy on the maximum age of each dataset and implement an automated time-to-live (TTL) on each data store so you don't even have to worry about spring cleaning or getting gun-shy when it comes time to hit the delete button.
Unused fields and forecasts. Did you know the GFS atmos pgrb files contain, at the time of this writing, 743 fields? Did you also know you usually don't need to download the entire GRIB file to get the specific fields you want? No? Well then tune into a future article where we'll talk about that, but in the meantime if you're one of the poor souls who's downloading the entire 743 field GRIB file and storing it in your archive, listen up. There are undoubtedly a lot of fields in those files you will never touch, and some you've probably no idea what they even are. So why are you paying to store them? In a similar vein, the GFS goes out 16 days. If your shop offers short-term forecasts, then ditch the lead times you don't care about. GFS may be an extreme example but I've yet to work with a shop that isn't storing that model and usually retaining way more of it than they need.

4. Choose the correct storage format

I'll admit this can be a fairly subjective topic as the options for storing meteorological data are as varied as any industry there is, but in the context of optimizing storage costs, there are a few things worth mentioning:

Do not use CSV for gridded data. This is far and away #1 on my shit list; nothing else comes close. Yes, CSVs are compressible. No, you won't reach the same degree of compression in a CSV as you would from a well constructed GRIB or NetCDF file, and even if you do, trying to work with that data in any application is going to absolutely decimate your performance. CSVs are typically used by people who are new to wrangling gridded data because they're human readable, and from a purely short-term investigatory perspective they can be useful, but do yourself a favor and learn one of the myriad data packages that allow you to work with binary files natively.
Revisit poorly designed relational databases. You know the kind - they're basically a CSV shoved into one wide database table. Imagine you're storing surface observations in PostgreSQL or MySQL. If you have one table which is repeating the latitude and longitude for every observation at every time, then you're not leveraging the "relational" aspect of a relational database. Break data that rarely ever changes such as the position of a weather station out into its own table or tables, store the observation in its own tables, and join the two (i.e. normalize your data). Oh and pay attention to your data types - you don't need a 64-bit data type to store temperature data, or really any other meteorological field unless you have one baller piece of instrumentation that can accurately measure atmospheric properties to 10 decimal places. And even if you do, do you really need that level of precision?
Chunking. Many gridded products can benefit from being stored using a format that supports chunking. If you're not familiar with the term, it simply means to subdivide the data into defined blocks (chunks). When a process needs to access a specific subset of the data, it can simply access the chunk or chunks that contain the subset, thereby reducing memory requirements and bandwidth (if loading from a remote location). An additional benefit is that each chunk can be compressed individually, thereby providing a nice balance of byte-range access and minimal storage space, as opposed to compressing the entire dataset in a single blob which would then require decompressing the entire dataset when reading it in. The zarr format is one excellent example and most of the major coding languages support chunked formats.

5. Don't compress what's already compressed

We end with, you guessed it, a third entry into the "no shit" arena, and fortunately this one shows up a lot less frequently than the first two but it does bare mentioning. Unless you are absolutely scraping the bottom of the barrel trying to shave every last byte you can off your storage usage, it's generally a bad idea to zip (or gzip) GRIBs, NetCDFs (v4), cloud-optimized geotiffs, and other already compressed data types. The amount of additional size reduction is typically minimal and, more importantly, you're making these files much more difficult to work with because now they have to be decompressed prior to reading. For chunked data or files with multiple data bands you're also preventing the ability for code to read specific byte-ranges which is a huge benefit of these formats. So do everyone a favor and leave the zipping for uncompressed data formats.

Most shops can benefit from reviewing the above 5 areas in their data storage periodically and it doesn't take long at all to spot issues and clean things up. Spending just a few hours each quarter on maintaining your storage can make a huge difference.

In future posts we'll get more down and dirty with the data formats themselves and look at how to leverage characteristics of the data to help select the optimal storage solution. Sign up below to be notified when those posts drop.

Comments ( )

Comments ()