For more than a decade now, the Hive table format has been a ubiquitous presence in the big data ecosystem, managing petabytes of data with remarkable efficiency and scale. But as the data volumes, data variety, and data usage grows, users face many challenges when using Hive tables because of its antiquated directory-based table format. Some of the common issues include constrained schema evolution, static partitioning of data, and long planning time because of S3 directory listings.
Apache Iceberg is a modern table format that not only addresses these problems but also adds additional features like time travel, partition evolution, table versioning, schema evolution, strong consistency guarantees, object store file layout (the ability to distribute files present in one logical partition across many prefixes to avoid object store throttling), hidden partitioning (users don’t have to be intimately aware of partitioning), and more. Therefore, Apache Iceberg table format is poised to replace the traditional Hive table format in the coming years.
However, as there are already 25 million terabytes of data stored in the Hive table format, migrating existing tables in the Hive table format into the Iceberg table format is necessary for performance and cost. Depending on the size and usage patterns of the data, several different strategies could be pursued to achieve a successful migration. In this blog, I will describe a few strategies one could undertake for various use cases. While these instructions are carried out for Cloudera Data Platform (CDP), Cloudera Data Engineering, and Cloudera Data Warehouse, one can extrapolate them easily to other services and other use cases as well.
There are few scenarios that one might encounter. One or more of these use cases might fit your workload and you might be able to mix and match the potential solutions provided to suit your needs. They are meant to be a general guide. In all the use cases we are trying to migrate a table named “events.”
You have the ability to stop your clients from writing to the respective Hive table during the duration of your migration. This is ideal because this might mean that you don’t have to modify any of your client code. Sometimes this is the only choice available if you have hundreds of clients that can potentially write to a table. It could be much easier to simply stop all those jobs rather than allowing them to continue during the migration process.
In-place table migration
Iceberg’s Spark extensions provide an in-built procedure called “migrate” to migrate an existing table from Hive table format to Iceberg table format. They also provide a “snapshot” procedure that creates an Iceberg table with a different name with the same underlying data. You could first create a snapshot table, run sanity checks on the snapshot table, and ensure that everything is in order.
Once you are satisfied you can drop the snapshot table and proceed with the migration using the migrate procedure. Keep in mind that the migrate procedure creates a backup table named “events__BACKUP__.” As of this writing, the “__BACKUP__” suffix is hardcoded. There is an effort underway to let the user pass a custom backup suffix in the future.
Keep in mind that both the migrate and snapshot procedures do not modify the underlying data: they perform in-place migration. They simply read the underlying data (not even full read, they just read the parquet headers) and create corresponding Iceberg metadata files. Since the underlying data files are not changed, you may not be able to take full advantage of the benefits offered by Iceberg right away. You could optimize your table now or at a later stage using the “rewrite_data_files” procedure. This will be discussed in a later blog. Now let's discuss the pros and cons of this approach.
PROS:
CONS:
Note: There is also a SparkAction in the JAVA API.
Cloudera implemented an easy way to do the migration in Hive. All you have to do is to alter the table properties to set the storage handler to “HiveIcebergStorageHandler.”
The pros and cons of this approach are essentially the same as Solution 1B. The migration is done in place and the underlying data files are not changed. Hive creates Iceberg’s metadata files for the same exact table.
This solution is most generic and it could potentially be used with any processing engine (Spark/Hive/Impala) that supports SQL-like syntax.
You can run basic sanity checks on the data to see if the newly created table is sound.
Once you are satisfied with your sanity checking you could rename your “events” table to a “backup_events” table and then rename your “iceberg_events” to “events.” Keep in mind that in some cases the rename operation might trigger a directory rename of the underlying data directory. If that is the case and your underlying data store is an object store like S3, that will trigger a full copy of your data and could be very expensive. If while creating the Iceberg table the location clause is specified, then the renaming operation of the Iceberg table will not cause the underlying data files to move. The name will change only in the Hive metastore. The same applies for Hive tables as well. If your original Hive table was not created with the location clause specified, then the rename to backup will trigger a directory rename. In that case, If your filesystem is object store based, then it might be best to drop it altogether. Given the nuances around table rename it is critical to test with dummy tables in your system and check that you are seeing your desired behavior before you perform these operations on critical tables.
You can drop your “backup_events” if you wish.
Your clients can now resume their read/write operations on the “events” and they don’t even need to know that the underlying table format has changed. Now let’s discuss the pros and cons of this approach.
PROS:
CONS
You don’t have the luxury of long downtime to do your migration. You want to let your clients or jobs continue writing the data to the table. This requires some planning and testing, but is possible with some caveats. Here is one way you can do it with Spark. You can potentially extrapolate the ideas presented to other engines.
Any large migration is tough and has to be thought through carefully. Thankfully, as discussed above there are multiple strategies at our disposal to do it effectively depending on your use case. If you have the ability to stop all your jobs while the migration is happening it is relatively straightforward, but if you want to migrate with minimal to no downtime then that requires some planning and careful thinking through your data layout. You can use a combination of the above approaches to best suit your needs.
To learn more:
This may have been caused by one of the following: