Using the Parquet File Format with Impala Tables
Impala helps you to create, manage, and query Parquet tables. Parquet is a column-oriented binary file format intended to be highly efficient for the types of large-scale queries that Impala is best at. Parquet is especially good for queries scanning particular columns within a table, for example to query "wide" tables with many columns, or to perform aggregation operations such as SUM() and AVG() that need to process most or all of the values from a column. Each data file contains the values for a set of rows (the "row group"). Within a data file, the values from each column are organized so that they are all adjacent, enabling good compression for the values from that column. Queries against a Parquet table can retrieve and analyze these values from any column quickly and with minimal I/O.
See the following sections for details about using Parquet data files with Impala tables:
Creating Parquet Tables in Impala
To create a table named PARQUET_TABLE that uses the Parquet format, you would use a command like the following, substituting your own table name, column names, and data types:
[impala-host:21000] > create table parquet_table_name (x INT, y STRING) STORED AS PARQUETFILE;
Or, to clone the column names and data types of an existing table:
[impala-host:21000] > create table parquet_table_name LIKE other_table_name STORED AS PARQUETFILE;
Once you have created a table, to insert data into that table, use a command similar to the following, again with your own table names:
[impala-host:21000] > insert overwrite table parquet_table_name select * from other_table_name;
If the Parquet table has a different number of columns or different column names than the other table, specify the names of columns from the other table rather than * in the SELECT statement.
Loading Data into Parquet Tables
Choose from the following techniques for loading data into Parquet tables, depending on whether the original data is already in an Impala table, or exists as raw data files outside Impala.
If you already have data in an Impala table, perhaps in a different file format or partitioning scheme, you can transfer the data to a Parquet table using the INSERT...SELECT syntax. You can convert, filter, repartition, and do other things to the data as part of this same INSERT statement. See Snappy and GZip Compression for Parquet Data Files for some examples showing how to insert data into Parquet tables.
Avoid the INSERT...VALUES syntax for Parquet tables, because INSERT...VALUES produces a separate tiny data file for each INSERT...VALUES statement, and the strength of Parquet is in its handling of data (compressing, parallelizing, and so on) in 1GB chunks.
If you have one or more Parquet data files produced outside of Impala, you can quickly make the data queryable through Impala by one of the following methods:
- The LOAD DATA statement moves a single data file or a directory full of data files into the data directory for an Impala table. It does no validation or conversion of the data. The original data files must be somewhere in HDFS, not the local filesystem.
- The CREATE TABLE statement with the LOCATION clause creates a table where the data continues to reside outside the Impala data directory. The original data files must be somewhere in HDFS, not the local filesystem. For extra safety, if the data is intended to be long-lived and reused by other applications, you can use the CREATE EXTERNAL TABLE syntax so that the data files are not deleted by an Impala DROP TABLE statement.
- If the Parquet table already exists, you can copy Parquet data files directly into it, then use the REFRESH statement to make Impala recognize the newly added data. Remember to preserve the 1GB block size of the Parquet data files by using the hdfs distcp -pb command rather than a -put or -cp operation on the Parquet files. See Example of Copying Parquet Data Files for an example of this kind of operation.
If the data exists outside Impala and is in some other format, combine both of the preceding techniques. First, use a LOAD DATA or CREATE EXTERNAL TABLE ... LOCATION statement to bring the data into an Impala table that uses the appropriate file format. Then, use an INSERT...SELECT statement to copy the data to the Parquet table, converting to Parquet format as part of the process.
Loading data into Parquet tables is a memory-intensive operation, because the incoming data is buffered until it reaches 1GB in size, then that chunk of data is organized and compressed in memory before being written out. The memory consumption can be larger when inserting data into partitioned Parquet tables, because a separate data file is written for each combination of partition key column values, potentially requiring several 1GB chunks to be manipulated in memory at once.
When inserting into a partitioned Parquet table, Impala redistributes the data among the nodes to reduce memory consumption. You might still need to temporarily increase the memory dedicated to Impala during the insert operation, or break up the load operation into several INSERT statements, or both.
Query Performance for Impala Parquet Tables
Query performance for Parquet tables depends on the number of columns needed to process the SELECT list and WHERE clauses of the query, the way data is divided into 1GB data files ("row groups"), the reduction in I/O by reading the data for each column in compressed format, which data files can be skipped (for partitioned tables), and the CPU overhead of decompressing the data for each column.
select avg(income) from census_data where state = 'CA';The query processes only 2 columns out of a large number of total columns. If the table is partitioned by the STATE column, it is even more efficient because the query only has to read and decode 1 column from each data file, and it can read only the data files in the partition directory for the state 'CA', skipping the data files for all the other states, which will be physically located in other directories.
select * from census_data;Impala would have to read the entire contents of each 1GB data file, and decompress the contents of each column for each row group, negating the I/O optimizations of the column-oriented format. This query might still be faster for a Parquet table than a table with some other file format, but it does not take advantage of the unique strengths of Parquet data files.
Partitioning for Parquet Tables
As explained in Partitioning, partitioning is an important performance technique for Impala generally. This section explains some of the performance considerations for partitioned Parquet tables.
The Parquet file format is ideal for tables containing many columns, where most queries only refer to a small subset of the columns. As explained in How Parquet Data Files Are Organized, the physical layout of Parquet data files lets Impala read only a small fraction of the data for many queries. The performance benefits of this approach are amplified when you use Parquet tables in combination with partitioning. Impala can skip the data files for certain partitions entirely, based on the comparisons in the WHERE clause that refer to the partition key columns. For example, queries on partitioned tables often analyze data for time intervals based on columns such as YEAR, MONTH, and/or DAY, or for geographic regions. Remember that Parquet data files use a 1GB block size, so when deciding how finely to partition the data, try to find a granularity where each partition contains 1GB or more of data, rather than creating a large number of smaller files split among many partitions.
Inserting into a partitioned Parquet table can be a resource-intensive operation, because each Impala node could potentially be writing a separate data file to HDFS for each combination of different values for the partition key columns. The large number of simultaneous open files could exceed the HDFS "transceivers" limit. To avoid exceeding this limit, consider the following techniques:
- Load different subsets of data using separate INSERT statements with specific values for the PARTITION clause, such as PARTITION (year=2010).
- Increase the "transceivers" value for HDFS, sometimes spelled "xcievers" (sic). The property value in the hdfs-site.xml configuration file is dfs.datanode.max.xcievers. For example, if you were loading 12 years of data partitioned by year, month, and day, even a value of 4096 might not be high enough. This blog post explores the considerations for setting this value higher or lower, using HBase examples for illustration.
- Collect column statistics on the source table from which data is being copied, so that the Impala query can estimate the number of different values in the partition key columns and distribute the work accordingly.
Snappy and GZip Compression for Parquet Data Files
When Impala writes Parquet data files using the INSERT statement, the underlying compression is controlled by the PARQUET_COMPRESSION_CODEC query option. The allowed values for this query option are snappy (the default), gzip, and none. The option value is not case-sensitive. If the option is set to an unrecognized value, all kinds of queries will fail due to the invalid option setting, not just queries involving Parquet tables.
Example of Parquet Table with Snappy Compression
By default, the underlying data files for a Parquet table are compressed with Snappy. The combination of fast compression and decompression makes it a good choice for many data sets. To ensure Snappy compression is used, for example after experimenting with other compression codecs, set the PARQUET_COMPRESSION_CODEC query option to snappy before inserting the data:
[localhost:21000] > create database parquet_compression; [localhost:21000] > use parquet_compression; [localhost:21000] > create table parquet_snappy like raw_text_data; [localhost:21000] > set PARQUET_COMPRESSION_CODEC=snappy; [localhost:21000] > insert into parquet_snappy select * from raw_text_data; Inserted 1000000000 rows in 181.98s
Example of Parquet Table with GZip Compression
If you need more intensive compression (at the expense of more CPU cycles for uncompressing during queries), set the PARQUET_COMPRESSION_CODEC query option to gzip before inserting the data:
[localhost:21000] > create table parquet_gzip like raw_text_data; [localhost:21000] > set PARQUET_COMPRESSION_CODEC=gzip; [localhost:21000] > insert into parquet_gzip select * from raw_text_data; Inserted 1000000000 rows in 1418.24s
Example of Uncompressed Parquet Table
If your data compresses very poorly, or you want to avoid the CPU overhead of compression and decompression entirely, set the PARQUET_COMPRESSION_CODEC query option to none before inserting the data:
[localhost:21000] > create table parquet_none like raw_text_data; [localhost:21000] > insert into parquet_none select * from raw_text_data; Inserted 1000000000 rows in 146.90s
Examples of Sizes and Speeds for Compressed Parquet Tables
Here are some examples showing differences in data sizes and query speeds for 1 billion rows of synthetic data, compressed with each kind of codec. As always, run similar tests with realistic data sets of your own. The actual compression ratios, and relative insert and query speeds, will vary depending on the characteristics of the actual data.
In this case, switching from Snappy to GZip compression shrinks the data by an additional 40% or so, while switching from Snappy compression to no compression expands the data also by about 40%:
$ hdfs dfs -du -h /user/hive/warehouse/parquet_compression.db 23.1 G /user/hive/warehouse/parquet_compression.db/parquet_snappy 13.5 G /user/hive/warehouse/parquet_compression.db/parquet_gzip 32.8 G /user/hive/warehouse/parquet_compression.db/parquet_none
Because Parquet data files are typically sized at about 1GB, each directory will have a different number of data files and the row groups will be arranged differently.
At the same time, the less agressive the compression, the faster the data can be decompressed. In this case using a table with a billion rows, a query that evaluates all the values for a particular column runs faster with no compression than with Snappy compression, and faster with Snappy compression than with Gzip compression. Query performance depends on several other factors, so as always, run your own benchmarks with your own data to determine the ideal tradeoff between data size, CPU efficiency, and speed of insert and query operations.
[localhost:21000] > desc parquet_snappy; Query finished, fetching results ... +-----------+---------+---------+ | name | type | comment | +-----------+---------+---------+ | id | int | | | val | int | | | zfill | string | | | name | string | | | assertion | boolean | | +-----------+---------+---------+ Returned 5 row(s) in 0.14s [localhost:21000] > select avg(val) from parquet_snappy; Query finished, fetching results ... +-----------------+ | _c0 | +-----------------+ | 250000.93577915 | +-----------------+ Returned 1 row(s) in 4.29s [localhost:21000] > select avg(val) from parquet_gzip; Query finished, fetching results ... +-----------------+ | _c0 | +-----------------+ | 250000.93577915 | +-----------------+ Returned 1 row(s) in 6.97s [localhost:21000] > select avg(val) from parquet_none; Query finished, fetching results ... +-----------------+ | _c0 | +-----------------+ | 250000.93577915 | +-----------------+ Returned 1 row(s) in 3.67s
Example of Copying Parquet Data Files
Here is a final example, to illustrate how the data files using the various compression codecs are all compatible with each other for read operations. The metadata about the compression format is written into each data file, and can be decoded during queries regardless of the PARQUET_COMPRESSION_CODEC setting in effect at the time. In this example, we copy data files from the PARQUET_SNAPPY, PARQUET_GZIP, and PARQUET_NONE tables used in the previous examples, each containing 1 billion rows, all to the data directory of a new table PARQUET_EVERYTHING. A couple of sample queries demonstrate that the new table now contains 3 billion rows featuring a variety of compression codecs for the data files.
First, we create the table in Impala so that there is a destination directory in HDFS to put the data files:
[localhost:21000] > create table parquet_everything like parquet_snappy; Query: create table parquet_everything like parquet_snappy
Then in the shell, we copy the relevant data files into the data directory for this new table. Rather than using hdfs dfs -cp as with typical files, we use hdfs distcp -pb to ensure that the special 1GB block size of the Parquet data files is preserved.
$ hdfs distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_snappy \ /user/hive/warehouse/parquet_compression.db/parquet_everything ...MapReduce output... $ hdfs distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_gzip \ /user/hive/warehouse/parquet_compression.db/parquet_everything ...MapReduce output... $ hdfs distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_none \ /user/hive/warehouse/parquet_compression.db/parquet_everything ...MapReduce output...
Back in the impala-shell interpreter, we use the REFRESH statement to alert the Impala server to the new data files for this table, then we can run queries demonstrating that the data files represent 3 billion rows, and the values for one of the numeric columns match what was in the original smaller tables:
[localhost:21000] > refresh parquet_everything; Query finished, fetching results ... Returned 0 row(s) in 0.32s [localhost:21000] > select count(*) from parquet_everything; Query finished, fetching results ... +------------+ | _c0 | +------------+ | 3000000000 | +------------+ Returned 1 row(s) in 8.18s [localhost:21000] > select avg(val) from parquet_everything; Query finished, fetching results ... +-----------------+ | _c0 | +-----------------+ | 250000.93577915 | +-----------------+ Returned 1 row(s) in 13.35s
Exchanging Parquet Data Files with Other Hadoop Components
Starting in CDH 4.5, you can read and write Parquet data files from Hive, Pig, and MapReduce. See the CDH 4 Installation Guide for details.
Previously, it was not possible to create Parquet data through Impala and reuse that table within Hive. Now that Parquet support is available for Hive in CDH 4.5, reusing existing Impala Parquet data files in Hive requires updating the table metadata. Use the following command if you are already running Impala 1.1.1 or higher:
ALTER TABLE table_name SET FILEFORMAT PARQUETFILE;
If you are running a level of Impala that is older than 1.1.1, do the metadata update through Hive:
ALTER TABLE table_name SET SERDE 'parquet.hive.serde.ParquetHiveSerDe'; ALTER TABLE table_name SET FILEFORMAT INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat" OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat";
Impala 1.1.1 and higher can reuse Parquet data files created by Hive, without any action required.
Impala supports the scalar data types that you can encode in a Parquet data file, but not composite or nested types such as maps or arrays. If any column of a table uses such an unsupported data type, Impala cannot access that table.
If you copy Parquet data files between nodes, or even between different directories on the same node, make sure to preserve the block size by using the command hadoop distcp -pb. To verify that the block size was preserved, issue the command hdfs fsck -blocks HDFS_path_of_impala_table_dir and check that the average block size is at or near 1GB. (The hadoop distcp operation typically leaves some directories behind, with names matching _distcp_logs_*, that you can delete from the destination directory afterward.) See the Hadoop DiscCP Guide for details.
How Parquet Data Files Are Organized
Although Parquet is a column-oriented file format, do not expect to find one data file for each column. Parquet keeps all the data for a row within the same data file, to ensure that the columns for a row are always available on the same node for processing. What Parquet does is to set an HDFS block size and a maximum data file size of 1GB, to ensure that I/O and network transfer requests apply to large batches of data.
Within that gigabyte of space, the data for a set of rows is rearranged so that all the values from the first column are organized in one contiguous block, then all the values from the second column, and so on. Putting the values from the same column next to each other lets Impala use effective compression techniques on the values in that column.
The Parquet data files have an HDFS block size of 1GB, the same as the maximum Parquet data file size, to ensure that each data file is represented by a single HDFS block, and the entire file can be processed on a single node without requiring any remote reads. If the block size is reset to a lower value during a file copy, you will see lower performance for queries involving those files, and the PROFILE statement will reveal that some I/O is being done suboptimally, through remote reads. See Example of Copying Parquet Data Files for an example showing how to preserve the block size when copying Parquet data files.
When Impala retrieves or tests the data for a particular column, it opens all the data files, but only reads the portion of each file where the values for that column are stored consecutively. If other columns are named in the SELECT list or WHERE clauses, the data for all columns in the same row is available within that same data file.
If an INSERT statement brings in less than 1GB of data, the resulting data file is smaller than ideal. Thus, if you do split up an ETL job to use multiple INSERT statements, try to keep the volume of data for each INSERT statement to approximately 1GB, or a multiple of 1GB.
RLE and Dictionary Encoding for Parquet Data Files
Parquet uses some automatic compression techniques, such as run-length encoding (RLE) and dictionary encoding, based on analysis of the actual data values. Once the data values are encoded in a compact form, the encoded data can optionally be further compressed using a compression algorithm. Parquet data files created by Impala can use Snappy, GZip, or no compression; the Parquet spec also allows LZO compression, but currently Impala does not support LZO-compressed Parquet files.
RLE and dictionary encoding are compression techniques that Impala applies automatically to groups of Parquet data values, in addition to any Snappy or GZip compression applied to the entire data files. These automatic optimizations can save you time and planning that are normally needed for a traditional data warehouse. For example, dictionary encoding reduces the need to create numeric IDs as abbreviations for longer string values.
Run-length encoding condenses sequences of repeated data values. For example, if many consecutive rows all contain the same value for a country code, those repeating values can be represented by the value followed by a count of how many times it appears consecutively.
Dictionary encoding takes the different values present in a column, and represents each one in compact 2-byte form rather than the original value, which could be several bytes. (Additional compression is applied to the compacted values, for extra space savings.) This type of encoding applies when the number of different values for a column is less than 2**16 (16,384). It does not apply to columns of data type BOOLEAN, which are already very short. TIMESTAMP columns sometimes have a unique value for each row, in which case they can quickly exceed the 2**16 limit on distinct values. The 2**16 limit on different values within a column is reset for each data file, so if several different data files each contained 10,000 different city names, the city name column in each data file could still be condensed using dictionary encoding.
|<< Previous: Using Text Data Files with Impala Tables||Next: Using the Avro File Format with Impala Tables >>|