Impala Upgrade Considerations
- Converting Legacy UDFs During Upgrade to CDH 5.12 or Higher
- Handling Large Rows During Upgrade to CDH 5.13 / Impala 2.10 or Higher
- Change Impala catalogd Heap when Upgrading from CDH 5.6 or Lower
- List of Reserved Words Updated in CDH 6.0 / Impala 3.0
- Decimal V2 Used by Default in CDH 6.0 / Impala 3.0
- Behavior of Column Aliases Changed in CDH 6.0 / Impala 3.0
- Default PARQUET_ARRAY_RESOLUTION Changed in CDH 6.0 / Impala 3.0
- Enable Clustering Hint for Inserts
- Deprecated Query Options Removed in CDH 6.0 / Impala 3.0
- refresh_after_connect Impala Shell Option Removed in CDH 6.0 / Impala 3.0
- Return Type Changed for EXTRACT and DATE_PART Functions in CDH 6.0 / Impala 3.0
- Impala Roles with SELECT or INSERT Privilege Receive REFRESH Privilege During the Upgrade to CDH 5.16
- Default Setting Changes
Converting Legacy UDFs During Upgrade to CDH 5.12 or Higher
In CDH 5.7 / Impala 2.5 and higher, the CREATE FUNCTION Statement is available for creating Java-based UDFs. UDFs created with the new syntax persist across Impala restarts, and are more compatible with Hive UDFs. Because the replication features in CDH 5.12 and higher only work with the new-style syntax, convert any older Java UDFs to use the new syntax at the same time you upgrade to CDH 5.12 or higher.
Follow these steps to convert old-style Java UDFs to the new persistent kind:
- Use SHOW FUNCTIONS to identify all UDFs and UDAs.
- For each function, use SHOW CREATE FUNCTION and save the statement in a script file.
- For Java UDFs, change the output of SHOW CREATE FUNCTION to use the new CREATE FUNCTION syntax (without argument types), which makes the UDF persistent.
- For each function, drop it and re-create it, using the new CREATE FUNCTION syntax for all Java UDFs.
Handling Large Rows During Upgrade to CDH 5.13 / Impala 2.10 or Higher
In CDH 5.13 / Impala 2.10 and higher, the handling of memory management for large column values is different than in previous releases. Some queries that succeeded previously might now fail immediately with an error message. The --read_size option no longer needs to be increased from its default of 8 MB for queries against tables with huge column values. Instead, the query option MAX_ROW_SIZE lets you fine-tune this value at the level of individual queries or sessions. The default for MAX_ROW_SIZE is 512 KB. If your queries process rows with column values totalling more than 512 KB, you might need to take action to avoid problems after upgrading.
Follow these steps to verify if your deployment needs any special setup to deal with the new way of dealing with large rows:
- Check if your impalad daemons are already running with a larger-than-normal value for the --read_size configuration setting.
- Examine all tables to find if any have STRING values that are hundreds of kilobytes or more in length. This information is available under the
Max Size column in the output from the SHOW TABLE STATS statement, after the COMPUTE STATS statement has
been run on the table. In the following example, the S1 column with a maximum length of 700006 could cause an issue by itself, or if a
combination of values from the S1, S2, and S3 columns exceeded the 512 KB MAX_ROW_SIZE value.
show column stats big_strings; +--------+--------+------------------+--------+----------+-------------------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | +--------+--------+------------------+--------+----------+-------------------+ | x | BIGINT | 30000 | -1 | 8 | 8 | | s1 | STRING | 30000 | -1 | 700006 | 392625 | | s2 | STRING | 30000 | -1 | 10532 | 9232.6669921875 | | s3 | STRING | 30000 | -1 | 103 | 87.66670227050781 | +--------+--------+------------------+--------+----------+-------------------+
- For each candidate table, run a query to materialize the largest string values from the largest columns all at once. Check if the query fails with a message suggesting to set the
MAX_ROW_SIZE query option.
select count(distinct s1, s2, s3) from little_strings; +----------------------------+ | count(distinct s1, s2, s3) | +----------------------------+ | 30000 | +----------------------------+ select count(distinct s1, s2, s3) from big_strings; WARNINGS: Row of size 692.13 KB could not be materialized in plan node with id 1. Increase the max_row_size query option (currently 512.00 KB) to process larger rows.
If any of your tables are affected, make sure the MAX_ROW_SIZE is set large enough to allow all queries against the affected tables to deal with the large column values:
In SQL scripts run by impala-shell with the -q or -f options, or in interactive impala-shell sessions, issue a statement SET MAX_ROW_SIZE=large_enough_size before the relevant queries:
$ impala-shell -i localhost -q \ 'set max_row_size=1mb; select count(distinct s1, s2, s3) from big_strings'
If large column values are common to many of your tables and it is not practical to set MAX_ROW_SIZE only for a limited number of queries or scripts, use the --default_query_options configuration setting for all your impalad daemons, and include the larger MAX_ROW_SIZE setting as part of the argument to that setting. For example:
If your deployment uses a non-default value for the --read_size configuration setting, remove that setting and let Impala use the default. A high value for --read_size could cause higher memory consumption in CDH 5.13 / Impala 2.10 and higher than in previous versions. The --read_size setting still controls the HDFS I/O read size (which is rarely if ever necessary to change), but no longer affects the spill-to-disk buffer size.
Change Impala catalogd Heap when Upgrading from CDH 5.6 or Lower
The default heap size for Impala catalogd has changed in CDH 5.7 / Impala 2.5 and higher:
- Before 5.7, by default catalogd was using the JVM's default heap size, which is the smaller of 1/4th of the physical memory or 32 GB.
- Starting with CDH 5.7.0, the default catalogd heap size is 4 GB.
For example, on a host with 128 GB physical memory this will result in catalogd heap decreasing from 32 GB to 4 GB.
Check current memory usage for the catalogd daemon by running the following commands on the host where that daemon runs on your cluster:
jcmd catalogd_pid VM.flags jmap -heap catalogd_pid
Decide on a large enough value for the catalogd heap. You use the JAVA_TOOL_OPTIONS environment variable to set the maximum heap size. For example, the following environment variable setting specifies the maximum heap size of 8 GB.
On systems managed by Cloudera Manager, include this value in the configuration field Java Heap Size of Catalog Server in Bytes (Cloudera Manager 5.7 and higher), or Impala Catalog Server Environment Advanced Configuration Snippet (Safety Valve) (prior to Cloudera Manager 5.7). Then restart the Impala service.
On systems not managed by Cloudera Manager, put the JAVA_TOOL_OPTIONS environment variable setting into the startup script for the catalogd daemon, then restart the catalogd daemon.
Use the same jcmd and jmap commands as earlier to verify that the new settings are in effect.
List of Reserved Words Updated in CDH 6.0 / Impala 3.0
The list of Impala Reserved Words in Impala was updated in CDH 6.0 / Impala 3.0. If you need to use a reserved word as an identifier, e.g. a table name, enclose the word in back-ticks.
If you need to use the reserved words from previous versions of CDH, set the impalad and catalogd startup option, reserved_words_version, to "2.11.0".
Decimal V2 Used by Default in CDH 6.0 / Impala 3.0
In Impala, two different behaviors of DECIMAL types are supported. In CDH 6.0 / Impala 3.0, DECIMAL V2 is used by default. See DECIMAL Data Type for detail information.
Behavior of Column Aliases Changed in CDH 6.0 / Impala 3.0
To conform to the SQL standard, Impala no longer performs alias substitution in the subexpressions of GROUP BY, HAVING, and ORDER BY. See Overview of Impala Aliases for examples of supported and unsupported aliases syntax.
Default PARQUET_ARRAY_RESOLUTION Changed in CDH 6.0 / Impala 3.0
The default value for the PARQUET_ARRAY_RESOLUTION was changed to THREE_LEVEL in CDH 6.0 / Impala 3.0, to match the Parquet standard 3-level encoding.
See PARQUET_RESOLUTION Query option for the information about the query option.
Enable Clustering Hint for Inserts
In CDH 6.0 / Impala 3.0, the clustered hint is enabled by default. The hint adds a local sort by the partitioning columns to a query plan.
The clustered hint is only effective for HDFS and Kudu tables.
As in previous versions, the noclustered hint prevents clustering. If a table has ordering columns defined, the noclustered hint is ignored with a warning.
Deprecated Query Options Removed in CDH 6.0 / Impala 3.0
refresh_after_connect Impala Shell Option Removed in CDH 6.0 / Impala 3.0
The deprecated refresh_after_connect option was removed from Impala Shell in CDH 6.0 / Impala 3.0
Return Type Changed for EXTRACT and DATE_PART Functions in CDH 6.0 / Impala 3.0
- The output type of the EXTRACT and DATE_PART functions was changed to BIGINT.
Extracting the millisecond part from a TIMESTAMP returns the seconds component and the milliseconds component. For example, EXTRACT (CAST('2006-05-12 18:27:28.123456789' AS TIMESTAMP), 'MILLISECOND') will return 28123.
Impala Roles with SELECT or INSERT Privilege Receive REFRESH Privilege During the Upgrade to CDH 5.16
Due to the Sentry and Impala fine grained privileges feature in CDH 5.16.0 / CDH 6.1, if a role has the SELECT or INSERT privilege on an object in Impala before upgrading to CDH 5.16.0, that role will automatically get the REFRESH privilege during the upgrade.
Default Setting Changes
|Release Changed||Setting||Default Value|
|CDH 5.15 / Impala 2.12||compact_catalog_topicimpalad flag||true|
|Impala 2.12||max_cached_file_handlesimpalad flag||20000|
|CDH 6.0 / Impala 3.0||PARQUET_ARRAY_RESOLUTION query option||THREE_LEVEL|