Upgrading Impala

Upgrading Impala involves stopping Impala services, using your operating system's package management tool to upgrade Impala to the latest version, and then restarting Impala services.

Upgrading Impala through Cloudera Manager - Parcels

Parcels are an alternative binary distribution format available in Cloudera Manager 4.5 and higher.

To upgrade Impala for CDH 4 in a Cloudera Managed environment, using parcels:

  1. If you originally installed using packages and now are switching to parcels, remove all the Impala-related packages first. You can check which packages are installed using one of the following commands, depending on your operating system:

    rpm -qa               # RHEL, Oracle Linux, CentOS, Debian
    dpkg --get-selections # Debian
    and then remove the packages using one of the following commands:
    sudo yum remove pkg_names    # RHEL, Oracle Linux, CentOS
    sudo zypper remove pkg_names # SLES
    sudo apt-get purge pkg_names # Ubuntu, Debian
  2. Connect to the Cloudera Manager Admin Console.

  3. Go to the Hosts > Parcels tab. You should see a parcel with a newer version of Impala that you can upgrade to.

  4. Click Download, then Distribute. (The button changes as each step completes.)

  5. Click Activate.

  6. When prompted, click Restart to restart the Impala service.

Upgrading Impala through Cloudera Manager - Packages

To upgrade Impala in a Cloudera Managed environment, using packages:

  1. Connect to the Cloudera Manager Admin Console.
  2. In the Services tab, click the Impala service.
  3. Click Actions and click Stop.
  4. Use one of the following sets of commands to update Impala on each Impala node in your cluster:

    For RHEL, Oracle Linux, or CentOS systems:

    $ sudo yum update impala
    $ sudo yum update hadoop-lzo-cdh4 # Optional; if this package is already installed
    

    For SUSE systems:

    $ sudo zypper update impala
    $ sudo zypper update hadoop-lzo-cdh4 # Optional; if this package is already installed
    

    For Debian or Ubuntu systems:

    $ sudo apt-get install impala
    $ sudo apt-get install hadoop-lzo-cdh4 # Optional; if this package is already installed
    
  5. Use one of the following sets of commands to update Impala shell on each node on which it is installed:

    For RHEL, Oracle Linux, or CentOS systems:

    $ sudo yum update impala-shell

    For SUSE systems:

    $ sudo zypper update impala-shell

    For Debian or Ubuntu systems:

    $ sudo apt-get install impala-shell
  6. Connect to the Cloudera Manager Admin Console.
  7. In the Services tab, click the Impala service.
  8. Click Actions and click Start.

Upgrading Impala from the Command Line

To upgrade Impala on a cluster by using the command-line, run these Linux commands on the appropriate hosts in your cluster:

  1. Stop Impala services.
    1. Stop impalad on each Impala node in your cluster:
      $ sudo service impala-server stop
    2. Stop any instances of the state store in your cluster:
      $ sudo service impala-state-store stop
    3. Stop any instances of the catalog service in your cluster:
      $ sudo service impala-catalog stop
  2. Check if there are new recommended or required configuration settings to put into place in the configuration files, typically under /etc/impala/conf. See Post-Installation Configuration for Impala for settings related to performance and scalability.
  3. Use one of the following sets of commands to update Impala on each Impala node in your cluster:

    For RHEL, Oracle Linux, or CentOS systems:

    $ sudo yum update impala-server
    $ sudo yum update hadoop-lzo-cdh4 # Optional; if this package is already installed
    $ sudo yum update impala-catalog # New in Impala 1.2; do yum install when upgrading from 1.1.
    

    For SUSE systems:

    $ sudo zypper update impala-server
    $ sudo zypper update hadoop-lzo-cdh4 # Optional; if this package is already installed
    $ sudo zypper update impala-catalog # New in Impala 1.2; do zypper install when upgrading from 1.1.
    

    For Debian or Ubuntu systems:

    $ sudo apt-get install impala-server
    $ sudo apt-get install hadoop-lzo-cdh4 # Optional; if this package is already installed
    $ sudo apt-get install impala-catalog # New in Impala 1.2.
    
  4. Use one of the following sets of commands to update Impala shell on each node on which it is installed:

    For RHEL, Oracle Linux, or CentOS systems:

    $ sudo yum update impala-shell

    For SUSE systems:

    $ sudo zypper update impala-shell

    For Debian or Ubuntu systems:

    $ sudo apt-get install impala-shell
  5. Depending on which release of Impala you are upgrading from, you might find that the symbolic links /etc/impala/conf and /usr/lib/impala/sbin are missing. If so, see Apache Impala Known Issues for the procedure to work around this problem.
  6. Restart Impala services:
    1. Restart the Impala state store service on the desired nodes in your cluster. Expect to see a process named statestored if the service started successfully.
      $ sudo service impala-state-store start
      $ ps ax | grep [s]tatestored
       6819 ?        Sl     0:07 /usr/lib/impala/sbin/statestored -log_dir=/var/log/impala -state_store_port=24000
      

      Restart the state store service before the Impala server service to avoid "Not connected" errors when you run impala-shell.

    2. Restart the Impala catalog service on whichever host it runs on in your cluster. Expect to see a process named catalogd if the service started successfully.
      $ sudo service impala-catalog restart
      $ ps ax | grep [c]atalogd
       6068 ?        Sl     4:06 /usr/lib/impala/sbin/catalogd
      
    3. Restart the Impala daemon service on each node in your cluster. Expect to see a process named impalad if the service started successfully.
      $ sudo service impala-server start
      $ ps ax | grep [i]mpalad
       7936 ?        Sl     0:12 /usr/lib/impala/sbin/impalad -log_dir=/var/log/impala -state_store_port=24000
       -state_store_host=127.0.0.1 -be_port=22000
      

Converting Legacy UDFs During Upgrade to CDH 5.12 or Higher

In CDH 5.7 / Impala 2.5 and higher, new syntax is available for creating Java-based UDFs. UDFs created with the new syntax persist across Impala restarts, and are more compatible with Hive UDFs. Because the replication features in CDH 5.12 and higher only work with the new-style syntax, convert any older Java UDFs to use the new syntax at the same time you upgrade to CDH 5.12 or higher.

Follow these steps to convert old-style Java UDFs to the new persistent kind:

  • Use SHOW FUNCTIONS to identify all UDFs and UDAs.

  • For each function, use SHOW CREATE FUNCTION and save the statement in a script file.

  • For Java UDFs, change the output of SHOW CREATE FUNCTION to use the new CREATE FUNCTION syntax (without argument types), which makes the UDF persistent.

  • For each function, drop it and re-create it, using the new CREATE FUNCTION syntax for all Java UDFs.

Handling Large Rows During Upgrade to CDH 5.13 / Impala 2.10 or Higher

In CDH 5.13 / Impala 2.10 and higher, the handling of memory management for large column values is different than in previous releases. Some queries that succeeded previously might now fail immediately with an error message. The --read_size option no longer needs to be increased from its default of 8 MB for queries against tables with huge column values. Instead, the query option MAX_ROW_SIZE lets you fine-tune this value at the level of individual queries or sessions. The default for MAX_ROW_SIZE is 512 KB. If your queries process rows with column values totalling more than 512 KB, you might need to take action to avoid problems after upgrading.

Follow these steps to verify if your deployment needs any special setup to deal with the new way of dealing with large rows:

  1. Check if your impalad daemons are already running with a larger-than-normal value for the --read_size configuration setting.

  2. Examine all tables to find if any have STRING values that are hundreds of kilobytes or more in length. This information is available under the Max Size column in the output from the SHOW TABLE STATS statement, after the COMPUTE STATS statement has been run on the table. In the following example, the S1 column with a maximum length of 700006 could cause an issue by itself, or if a combination of values from the S1, S2, and S3 columns exceeded the 512 KB MAX_ROW_SIZE value.

    show column stats big_strings;
    +--------+--------+------------------+--------+----------+-------------------+
    | Column | Type   | #Distinct Values | #Nulls | Max Size | Avg Size          |
    +--------+--------+------------------+--------+----------+-------------------+
    | x      | BIGINT | 30000            | -1     | 8        | 8                 |
    | s1     | STRING | 30000            | -1     | 700006   | 392625            |
    | s2     | STRING | 30000            | -1     | 10532    | 9232.6669921875   |
    | s3     | STRING | 30000            | -1     | 103      | 87.66670227050781 |
    +--------+--------+------------------+--------+----------+-------------------+
    
  3. For each candidate table, run a query to materialize the largest string values from the largest columns all at once. Check if the query fails with a message suggesting to set the MAX_ROW_SIZE query option.

    select count(distinct s1, s2, s3) from little_strings;
    +----------------------------+
    | count(distinct s1, s2, s3) |
    +----------------------------+
    | 30000                      |
    +----------------------------+
    
    select count(distinct s1, s2, s3) from big_strings;
    WARNINGS: Row of size 692.13 KB could not be materialized in plan node with id 1.
      Increase the max_row_size query option (currently 512.00 KB) to process larger rows.
    

If any of your tables are affected, make sure the MAX_ROW_SIZE is set large enough to allow all queries against the affected tables to deal with the large column values:

  • In SQL scripts run by impala-shell with the -q or -f options, or in interactive impala-shell sessions, issue a statement SET MAX_ROW_SIZE=large_enough_size before the relevant queries:

    $ impala-shell -i localhost -q \
      'set max_row_size=1mb; select count(distinct s1, s2, s3) from big_strings'
    
  • If large column values are common to many of your tables and it is not practical to set MAX_ROW_SIZE only for a limited number of queries or scripts, use the --default_query_options configuration setting for all your impalad daemons, and include the larger MAX_ROW_SIZE setting as part of the argument to that setting. For example:

    impalad --default_query_options='max_row_size=1gb;appx_count_distinct=true'
    
  • If your deployment uses a non-default value for the --read_size configuration setting, remove that setting and let Impala use the default. A high value for --read_size could cause higher memory consumption in CDH 5.13 / Impala 2.10 and higher than in previous versions. The --read_size setting still controls the HDFS I/O read size (which is rarely if ever necessary to change), but no longer affects the spill-to-disk buffer size.