Impala Upgrade Considerations

Converting Legacy UDFs During Upgrade to CDH 5.12 or Higher

In and higher, new syntax is available for creating Java-based UDFs. UDFs created with the new syntax persist across Impala restarts, and are more compatible with Hive UDFs. Because the replication features in CDH 5.12 and higher only work with the new-style syntax, convert any older Java UDFs to use the new syntax at the same time you upgrade to CDH 5.12 or higher.

Follow these steps to convert old-style Java UDFs to the new persistent kind:

  • Use SHOW FUNCTIONS to identify all UDFs and UDAs.
  • For each function, use SHOW CREATE FUNCTION and save the statement in a script file.
  • For Java UDFs, change the output of SHOW CREATE FUNCTION to use the new CREATE FUNCTION syntax (without argument types), which makes the UDF persistent.
  • For each function, drop it and re-create it, using the new CREATE FUNCTION syntax for all Java UDFs.

Handling Large Rows During Upgrade to or Higher

In and higher, the handling of memory management for large column values is different than in previous releases. Some queries that succeeded previously might now fail immediately with an error message. The --read_size option no longer needs to be increased from its default of 8 MB for queries against tables with huge column values. Instead, the query option MAX_ROW_SIZE lets you fine-tune this value at the level of individual queries or sessions. The default for MAX_ROW_SIZE is 512 KB. If your queries process rows with column values totalling more than 512 KB, you might need to take action to avoid problems after upgrading.

Follow these steps to verify if your deployment needs any special setup to deal with the new way of dealing with large rows:

  1. Check if your impalad daemons are already running with a larger-than-normal value for the --read_size configuration setting.
  2. Examine all tables to find if any have STRING values that are hundreds of kilobytes or more in length. This information is available under the Max Size column in the output from the SHOW TABLE STATS statement, after the COMPUTE STATS statement has been run on the table. In the following example, the S1 column with a maximum length of 700006 could cause an issue by itself, or if a combination of values from the S1, S2, and S3 columns exceeded the 512 KB MAX_ROW_SIZE value.
    show column stats big_strings;
    +--------+--------+------------------+--------+----------+-------------------+
    | Column | Type   | #Distinct Values | #Nulls | Max Size | Avg Size          |
    +--------+--------+------------------+--------+----------+-------------------+
    | x      | BIGINT | 30000            | -1     | 8        | 8                 |
    | s1     | STRING | 30000            | -1     | 700006   | 392625            |
    | s2     | STRING | 30000            | -1     | 10532    | 9232.6669921875   |
    | s3     | STRING | 30000            | -1     | 103      | 87.66670227050781 |
    +--------+--------+------------------+--------+----------+-------------------+
    
  3. For each candidate table, run a query to materialize the largest string values from the largest columns all at once. Check if the query fails with a message suggesting to set the MAX_ROW_SIZE query option.
    select count(distinct s1, s2, s3) from little_strings;
    +----------------------------+
    | count(distinct s1, s2, s3) |
    +----------------------------+
    | 30000                      |
    +----------------------------+
    
    select count(distinct s1, s2, s3) from big_strings;
    WARNINGS: Row of size 692.13 KB could not be materialized in plan node with id 1.
      Increase the max_row_size query option (currently 512.00 KB) to process larger rows.
    

If any of your tables are affected, make sure the MAX_ROW_SIZE is set large enough to allow all queries against the affected tables to deal with the large column values:

  • In SQL scripts run by impala-shell with the -q or -f options, or in interactive impala-shell sessions, issue a statement SET MAX_ROW_SIZE=large_enough_size before the relevant queries:

    $ impala-shell -i localhost -q \
      'set max_row_size=1mb; select count(distinct s1, s2, s3) from big_strings'
    
  • If large column values are common to many of your tables and it is not practical to set MAX_ROW_SIZE only for a limited number of queries or scripts, use the --default_query_options configuration setting for all your impalad daemons, and include the larger MAX_ROW_SIZE setting as part of the argument to that setting. For example:

    impalad --default_query_options='max_row_size=1gb;appx_count_distinct=true'
    
  • If your deployment uses a non-default value for the --read_size configuration setting, remove that setting and let Impala use the default. A high value for --read_size could cause higher memory consumption in and higher than in previous versions. The --read_size setting still controls the HDFS I/O read size (which is rarely if ever necessary to change), but no longer affects the spill-to-disk buffer size.

Change Impala catalogd Heap when Upgrading from CDH 5.6 or Lower

The default heap size for Impala catalogd has changed in and higher:

  • Before 5.7, by default catalogd was using the JVM's default heap size, which is the smaller of 1/4th of the physical memory or 32 GB.
  • Starting with CDH 5.7.0, the default catalogd heap size is 4 GB.

For example, on a host with 128GB physical memory this will result in catalogd heap decreasing from 32GB to 4GB.

For schemas with large numbers of tables, partitions, and data files, the catalogd daemon might encounter an out-of-memory error. To prevent the error, increase the memory limit for the catalogd daemon:
  1. Check current memory usage for the catalogd daemon by running the following commands on the host where that daemon runs on your cluster:

      jcmd catalogd_pid VM.flags
      jmap -heap catalogd_pid
      
  2. Decide on a large enough value for the catalogd heap. You express it as an environment variable value as follows:

      JAVA_TOOL_OPTIONS="-Xmx8g"
      
  3. On systems managed by Cloudera Manager, include this value in the configuration field Java Heap Size of Catalog Server in Bytes (Cloudera Manager 5.7 and higher), or Impala Catalog Server Environment Advanced Configuration Snippet (Safety Valve) (prior to Cloudera Manager 5.7). Then restart the Impala service.

  4. On systems not managed by Cloudera Manager, put this environment variable setting into the startup script for the catalogd daemon, then restart the catalogd daemon.

  5. Use the same jcmd and jmap commands as earlier to verify that the new settings are in effect.

List of Reserved Words Updated in

The list of reserved words in Impala was updated in . If you need to use a reserved word as an identifier, e.g. a table name, enclose the word in back-ticks.

If you need to use the reserved words from previous versions of CDH, set the impalad and catalogd startup option, reserved_words_version, to "2.11.0".

Decimal V2 Used by Default in

In Impala, two different behaviors of DECIMAL types are supported. In , DECIMAL V2 is used by default. See DECIMAL Type for detail information.

If you need to continue using the first version of the DECIMAL type for the backward compatibility of your queries, set the DECIMAL_V2 query option to FALSE:
SET DECIMAL_V2=FALSE;

Behavior of Column Aliases Changed in

To conform to the SQL standard, Impala no longer performs alias substitution in the subexpressions of GROUP BY, HAVING, and ORDER BY. See for examples of supported and unsupported aliases syntax.

Default PARQUET_ARRAY_RESOLUTION Changed in

The default value for the PARQUET_ARRAY_RESOLUTION was changed to THREE_LEVEL in , to match the Parquet standard 3-level encoding.

See for the information about the query option.

Enable Clustering Hint for Inserts

In , the clustered hint is enabled by default. The hint adds a local sort by the partitioning columns to a query plan.

The clustered hint is only effective for HDFS and Kudu tables.

As in previous versions, the noclustered hint prevents clustering. If a table has ordering columns defined, the noclustered hint is ignored with a warning.

Deprecated Query Options Removed in

The following query options have been deprecated for several releases and removed:
  • DEFAULT_ORDER_BY_LIMIT
  • ABORT_ON_DEFAULT_LIMIT_EXCEEDED
  • V_CPU_CORES
  • RESERVATION_REQUEST_TIMEOUT
  • RM_INITIAL_MEM
  • SCAN_NODE_CODEGEN_THRESHOLD
  • MAX_IO_BUFFERS
  • RM_INITIAL_MEM
  • DISABLE_CACHED_READS

refresh_after_connect Impala Shell Option Removed in

The deprecated refresh_after_connect option was removed from Impala Shell in

Return Type Changed for EXTRACT and DATE_PART Functions in

The following changes were made to the EXTRACT and DATE_PART functions:
  • The output type of the EXTRACT and DATE_PART functions was changed to BIGINT.
  • Extracting the millisecond part from a TIMESTAMP returns the seconds component and the milliseconds component. For example, EXTRACT (CAST('2006-05-12 18:27:28.123456789' AS TIMESTAMP), 'MILLISECOND') will return 28123.

Default Setting Changes

Release Changed Setting Default Value
CDH 5.15 & CDH 6.1 / Impala 2.12 compact_catalog_topicimpalad flag true
CDH 5.15 / Impala 2.12 max_cached_file_handlesimpalad flag 20000
CDH 6.0 / Impala 3.0 PARQUET_ARRAY_RESOLUTION query option THREE_LEVEL