Using Text Data Files with Impala Tables
Cloudera Impala supports using text files as the storage format for input and output. Text files are a convenient format to use for interchange with other applications or scripts that produce or read delimited text files, such as CSV or TSV with commas or tabs for delimiters. Text files are also very flexible in their column definitions. For example, a text file could have more fields than the Impala table, and those extra fields are ignored during queries; or it could have fewer fields than the Impala table, and those missing fields are treated as NULL values in queries. You could have fields that were treated as numbers or timestamps in a table, then use ALTER TABLE ... REPLACE COLUMNS to switch them to strings, or the reverse.
Query Performance for Impala Text Tables
Data stored in text format is relatively bulky, and not as efficient to query as binary formats such as Parquet. You typically use text tables with Impala if that is the format you receive the data and you do not have control over that process, or if you are a relatively new Hadoop user and not familiar with techniques to generate files in other formats. (Because the default format for CREATE TABLE is text, you might create your first Impala tables as text without giving performance much thought.) Either way, look for opportunities to use more efficient file formats for the tables used in your most performance-critical queries.
For frequently queried data, you might load the original text data files into one Impala table, then use an INSERT statement to transfer the data to another table that uses the Parquet file format; the data is converted automatically as it is stored in the destination table.
For more compact data, consider using LZO compression for the text files. LZO is the only compression codec that Impala supports for text data, because the "splittable" nature of LZO data files lets different nodes work on different parts of the same file in parallel. See Using LZO-Compressed Text Files for details.
Creating Text Tables
To create a table using text data files:
If the exact format of the text data files (such as the delimiter character) is not significant, use the CREATE TABLE statement with no extra clauses at the end to create a text-format table. For example:
create table my_table(id int, s string, n int, t timestamp, b boolean);
The data files created by any INSERT statements will use the Ctrl-A character (hex 01) as a separator between each column value.
A common use case is to import existing text files into an Impala table. The syntax is more verbose, with several clauses at the end; the significant part is the FIELDS TERMINATED BY clause. For example:
create table csv(id int, s string, n int, t timestamp, b boolean) stored as textfile fields terminated by ','; create table tsv(id int, s string, n int, t timestamp, b boolean) stored as textfile fields terminated by '\t'; create table pipe_separated(id int, s string, n int, t timestamp, b boolean) stored as textfile fields terminated by '|';
You can create tables with specific separator characters to import text files in familiar formats such as CSV, TSV, or pipe-separated. You can also use these tables to produce output data files, by copying data into them through the INSERT ... SELECT syntax and then extracting the data files from the Impala data directory.
Do not surround string values with quotation marks in text data files that you construct. If you need to include the separator character inside a field value, for example to put a string value with a comma inside a CSV-format data file, specify an escape character on the CREATE TABLE statement with the ESCAPED BY clause, and insert that character immediately before any separator characters that need escaping.
Issue a DESCRIBE FORMATTED table_name statement to see the details of how each table is represented internally in Impala.
Data Files for Text Tables
When Impala queries a table with data in text format, it consults all the data files in the data directory for that table. Impala ignores any hidden files, that is, files whose names start with a dot. Otherwise, the file names are not significant.
Filenames for data produced through Impala INSERT statements are given unique names to avoid filename conflicts.
An INSERT ... SELECT statement produces one data file from each node that processes the SELECT part of the statement. An INSERT ... VALUES statement produces a separate data file for each statement; because Impala is more efficient querying a small number of huge files than a large number of tiny files, the INSERT ... VALUES syntax is not recommended for loading a substantial volume of data. If you find yourself with a table that is inefficient due to too many small data files, reorganize the data into a few large files by doing INSERT ... SELECT to transfer the data to a new table.
Loading Data into Impala Text Tables
To load an existing text file into an Impala text table, use the LOAD DATA statement and specify the path of the file in HDFS. That file is moved into the appropriate Impala data directory.
To load multiple existing text files into an Impala text table, use the LOAD DATA statement and specify the HDFS path of the directory containing the files. All non-hidden files are moved into the appropriate Impala data directory.
If a text file has fewer fields than the columns in the corresponding Impala table, all the corresponding columns are set to NULL when the data in that file is read by an Impala query.
If a text file has more fields than the columns in the corresponding Impala table, the extra fields are ignored when the data in that file is read by an Impala query.
You can also use manual HDFS operations such as hdfs dfs -put or hdfs dfs -cp to put data files in the data directory for an Impala table. When you copy or move new data files into the HDFS directory for the Impala table, issue a REFRESH table_name statement in impala-shell before issuing the next query against that table, to make Impala recognize the newly added files.
Using LZO-Compressed Text Files
Cloudera Impala supports using text data files that employ LZO compression. Cloudera recommends compressing text data files when practical. Impala queries are usually I/O-bound; reducing the amount of data read from disk typically speeds up a query, despite the extra CPU work to uncompress the data in memory.
Impala can work with LZO-compressed text files but not GZip-compressed text. LZO-compressed files are "splittable", meaning that different portions of a file can be uncompressed and processed independently by different nodes. GZip-compressed files are not splittable, making them unsuitable for Impala-style distributed queries.
Because Impala can query LZO-compressed files but currently cannot write them, you use Hive to do the initial CREATE TABLE and load the data, then switch back to Impala to run queries. For instructions on setting up LZO compression for Hive CREATE TABLE and INSERT statements, see the LZO page on the Hive wiki. Once you have created an LZO text table, you can also manually add LZO-compressed text files to it, produced by the lzop command or similar method.
Preparing to Use LZO-Compressed Text Files
Before using LZO-compressed tables in Impala, do the following one-time setup for each machine in the cluster. Install the necessary packages using either the Cloudera public repository, a private repository you establish, or by using packages. You must do these steps manually, whether or not the cluster is managed by the Cloudera Manager product.
Prepare your systems to work with LZO using Cloudera repositories:
On systems managed by Cloudera Manager, using parcels:
See the setup instructions for the LZO parcel in the Cloudera Manager documentation.
On systems managed by Cloudera Manager, using pacakges, or not managed by Cloudera Manager:
Download and install the appropriate file to each machine on which you intend to use LZO with Impala. These files all come from the Cloudera GPL extras download site. Install the:
Configure Impala to use LZO:
Use one of the following sets of commands to refresh your package management system's repository information, install the base LZO support for Hadoop, and install the LZO support for Impala.Note: The name of the Hadoop LZO package changes between CDH 4 and CDH 5. In CDH 4, the package name is hadoop-lzo-cdh4. In CDH 5, the package name is hadoop-lzo. Use the appropriate package name depending on the level of CDH in your cluster.
For RHEL/CentOS systems:
$ sudo yum update $ sudo yum install hadoop-lzo-cdh4 # For clusters running CDH 4. $ sudo yum install hadoop-lzo # For clusters running CDH 5 or higher. $ sudo yum install impala-lzo
For SUSE systems:
$ sudo apt-get update $ sudo zypper install hadoop-lzo-cdh4 # For clusters running CDH 4. $ sudo zypper install hadoop-lzo # For clusters running CDH 5 or higher. $ sudo zypper install impala-lzo
For Debian/Ubuntu systems:
$ sudo zypper update $ sudo apt-get install hadoop-lzo-cdh4 # For clusters running CDH 4. $ sudo apt-get install hadoop-lzo # For clusters running CDH 5 or higher. $ sudo apt-get install impala-lzoNote:
The level of the impala-lzo-cdh4 package is closely tied to the version of Impala you use. Any time you upgrade Impala, re-do the installation command for impala-lzo on each applicable machine to make sure you have the appropriate version of that package.
For core-site.xml on the client
and server (that is, in the configuration directories
for both Impala and Hadoop), append
com.hadoop.compression.lzo.LzopCodec to the comma-separated
list of codecs.
<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec, org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzopCodec</value> </property>Note:
If this is the first time you have edited the Hadoop core-site.xml file, note that the /etc/hadoop/conf directory is typically a symbolic link, so the canonical core-site.xml might reside in a different directory:
$ ls -l /etc/hadoop total 8 lrwxrwxrwx. 1 root root 29 Feb 26 2013 conf -> /etc/alternatives/hadoop-conf lrwxrwxrwx. 1 root root 10 Feb 26 2013 conf.dist -> conf.empty drwxr-xr-x. 2 root root 4096 Feb 26 2013 conf.empty drwxr-xr-x. 2 root root 4096 Oct 28 15:46 conf.pseudo
If the io.compression.codecs property is missing from core-site.xml, only add com.hadoop.compression.lzo.LzopCodec to the new property value, not all the names from the preceding example.
- Restart the MapReduce and Impala services.
Creating LZO Compressed Text Tables
A table containing LZO-compressed text files must be created in Hive with the following storage clause:
STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
Also, certain Hive settings need to be in effect. For example:
hive> SET mapreduce.output.fileoutputformat.compress=true; hive> SET hive.exec.compress.output=true; hive> SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec; hive> CREATE TABLE lzo_t (s string) STORED AS > INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'; hive> INSERT INTO TABLE lzo_t SELECT col1, col2 FROM uncompressed_text_table;
Once you have created LZO-compressed text tables, you can convert data stored in other tables (regardless of file format) by using the INSERT ... SELECT statement in Hive.
Files in an LZO-compressed table must use the .lzo extension. Examine the files in the HDFS data directory after doing the INSERT in Hive, to make sure the files have the right extension. If the required settings are not in place, you end up with regular uncompressed files, and Impala cannot access the table because it finds data files with the wrong (uncompressed) format.
After loading data into an LZO-compressed text table, index the files so that they can be split. You index the files by running a Java class, com.hadoop.compression.lzo.DistributedLzoIndexer, through the Linux command line. This Java class is included in the hadoop-lzo package.
Run the indexer using a command like the following:
$ hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar com.hadoop.compression.lzo.DistributedLzoIndexer /hdfs_location_of_table/
Indexed files have the same name as the file they index, with the .index extension. If the data files are not indexed, Impala queries still work, but the queries read the data from remote DataNodes, which is very inefficient.
Once the LZO-compressed tables are created, and data is loaded and indexed, you can query them through Impala. As always, the first time you start impala-shell after creating a table in Hive, issue an INVALIDATE METADATA statement so that Impala recognizes the new table. (In Impala 1.2 and higher, you only have to run INVALIDATE METADATA on one node, rather than on all the Impala nodes.)
|<< Previous: How Impala Works with Hadoop File Formats||Next: Using the Parquet File Format with Impala Tables >>|