Ever promise someone the moon? If so, it’s unlikely you knew the price tag in advance.
Whereas, if you promise someone a cloud, you can calculate your costs down to a thousandth of a cent.
Amazon, Azure, and Google offer cloud data storage cost calculators that will make your head spin with their specificity: How many TiB of data do you need for streaming reads on Google BigQuery? Do you want ra3.4xlarge or ra3.xlplus instances on Amazon Redshift—and how many nodes?
While storing data in the cloud is often billed as being more cost-efficient than using on-premises data storage, in truth reducing your cost for cloud storage requires investigation, elimination, and optimization. Let’s take it step by step.
One of the simplest ways of reducing data storage costs is to store less data. Obvious, yes. Easy, no.
There’s a reason why you have all that data. Sometimes a good reason—like for operational, administrative, and business processes—but sometimes the reason isn’t all that great, such as “we haven’t gotten rid of it yet.”
In every data ecosystem, there’s outdated, redundant, and bad quality data that you can—and should—get rid of. But how do you locate it?
The answer is automated data lineage: the data housekeeper’s faithful sidekick.
Imagine that you have a magic wand that helps with spring cleaning. This wand tells you where each item in your household was bought, when it was last used, what shape it’s in, if you have any other items that serve the same function, and so on.
This is what automated data lineage does for your data ecosystem. Let it loose, and within minutes you’ll have a complete mapping of your data flow: what data assets feed what reports and trace back to which sources. Comprehensive data lineage shows this both on a zoomed-out, source-system level, as well as on a zoomed-in, column-to-column level. It can even get into the ETL processes and show exactly what transformations were performed on the data as it moved.
Once you have the complete picture mapped out, you can move on to the second stage: elimination.
Take a close look at your data lineage, and ask the following questions:
Answering “yes” points you to data that can be offloaded, directly reducing cloud-based storage costs. But offload wisely! Even if you’ve identified two data assets that are effectively duplicates, if they are both being used by downstream reports, you can’t just go and delete one of them before you line up its replacement.
Leveraging your data lineage for impact analysis empowers you to foresee the impact of changing a business process and take proper advance action to prevent issues.
Now that you’ve identified and eliminated data you don’t need (outdated, redundant, bad quality), it’s time to move on to data that you do need to keep around, but you could store more efficiently.
Take another look at your data lineage mapping, and ask the following questions about the data you are storing:
Cloud-based data storage providers usually offer a range of storage levels that vary by their accessibility. For example, Amazon S3 offers Standard storage for frequently accessed data ($0.023 per GB), Standard – Infrequent Access storage for data that’s accessed infrequently but should be retrieved in milliseconds when needed ($0.0125 per GB), Glacier Flexible Retrieval storage for archive and backup data that should be retrieved in anywhere from 1 minute to 12 hours ($0.0036 per GB), and Glacier Deep Archive storage for archive data that's accessed only once or twice a year and will take 12 hours to retrieve ($0.00099 per GB).
Storing 1 TB of data in Standard storage would cost $23 a month. Storing the same 1 TB of data in Glacier Deep Archive Storage would cost $0.99 a month! If your organization currently stuffs all of its data into standard cloud storage without differentiating based on access needs, optimizing your storage can significantly reduce your storage costs.
Data lineage can reduce your data storage costs by showing you both:
But that's not all! While less data reduces cloud storage costs, it can also reduce compute costs. Cloud-based data warehouses like Snowflake and Amazon Redshift usually have a pay-per-usage model on compute, charging for the time it takes to run queries across the datasets. The more data you include in your query, the longer it will take to run, and the higher your charge will be.
Reducing the amount of data you’re storing (or keeping in standard storage) will usually mean less data included in your queries, indirectly reducing compute costs. But data lineage also provides you with a direct way to decrease your compute costs: restricting exploration queries.
Exploration queries tend to use a lot of computing power. With a clear data lineage map, your data team can see exactly where the relevant data is, enabling them to run much more targeted queries across the platform, and eliminating or reducing the need for general exploration queries.
If cloud data storage costs are getting you down, it’s time to turn the tables and get them down instead. Just pull out your automated data lineage magic wand and follow these steps: Investigate! Eliminate! Optimize!
See those data storage costs shrink!? Okay, it may take a wee bit more work than that. But when your enterprise gets its next, lower bill from its cloud data services provider, it will still feel magical.
Want to learn more? Request a demo to get started with Cloudera Octopai Data Lineage—an automated data lineage solution that can help you implement these steps and reduce your cloud storage costs today.
This may have been caused by one of the following: