This is the documentation for Cloudera Manager 4.8.5.
Documentation for other versions is available at Cloudera Documentation.

Installing Impala with Cloudera Manager

To use Cloudera Impala, you must install CDH and Impala (Hive is required, and gets installed with CDH). Install CDH and Impala on the nodes that will run Impala. Use only one of the following ways to deploy CDH and Impala:

  • CDH and Impala -
    • Installation Path A - Automated Installation by Cloudera Manager : Installs Cloudera Manager, CDH, and Impala as part of the process of using the pre-packaged installer. This method will install all the necessary software to run Cloudera Impala, will handle setting up the Hive metastore using the default PostgreSQL database, and will start the Impala Service along with the other CDH and Cloudera Manager services. Within the installation wizard you can install Impala using either packages or parcels.
    • Installation Path B - Installation Using Your Own Method : Installs Cloudera Manager, CDH, and Impala, specifying each package individually using package management tools. If you follow this method, then you will prepare for the installation by installing the Oracle JDK, creating databases, and determining how you will download packages. Once you have completed these prerequisites, you will install CDH and Cloudera Manager Server using packages. Next you will configure a database for Cloudera Manager and then install Cloudera Manager Agents. Finally you will start the Cloudera Manager Server and Agents and then configure services using the Cloudera Manager Admin console.
  • Impala - If you already have CDH installed, download, distribute, and activate an Impala parcel as described in Managing Parcels. Cloudera Manager is configured with a default version of Impala. If you want to choose a different version, configure the Impala parcel repository (a subdirectory of http://mirror.infra.cloudera.com/archive/impala/parcels/) as described in Parcel Configuration Settings.
  Warning: Cloudera Manager 4.8 supports only Impala 1.2, and does not support Impala 1.1.1 or earlier. (See the section on http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Release-Notes/cmrn_incompat_changes.html for more information.)

Managing Resources for Impala

Once you have installed Impala, you can coordinate its use of cluster resources in relation to MapReduce needs for the same resources. See Setting up a Multi-tenant Cluster for Impala and MapReduce below, as well as Resource Management in Managing Clusters with Cloudera Manager.

Running Impala with CDH 4.1

  Note: If you are running CDH 4.1, and the Bypass Hive Metastore Server option is enabled, you must add the following to the Impala Safety Valve for hive-site.xml, replacing <hive_metastore_server_host> with the name of your Hive metastore server host:
<property>
  <name> hive.metastore.local</name>
  <value>false</value> 
</property> 
<property>
  <name> hive.metastore.uris</name>
  <value>thrift://<hive_metastore_server_host>:9083</value> 
</property>
Otherwise, Impala queries will fail.

Configuring Hive Table Statistics

Configuring Hive table statistics is highly recommended when using Impala. It allows Impala to make optimizations that can result in significant (over 10x) performance improvements for some joins. If these are not available, Impala will still function, but at lower performance.

Configuring Hive to Store Statistics in MySQL

By default, Hive writes statistics to a Derby database backed by a file named /var/lib/hive/TempStatsStore. However, in production systems Cloudera recommends that you store statistics in a database. Hive table statistics are not supported for PostgreSQL or Oracle. To configure Hive to store statistics in MySQL:
  1. Set up a MySQL server. For instructions on setting up MySQL, see Installing and Configuring a MySQL Database .

    This database will be heavily loaded, so it should not be installed on the same host as anything critical such as the Hive Metastore Server, the database hosting the Hive Metastore, or Cloudera Manager Server. When collecting statistics on a large table and/or in a large cluster, this host may become slow or unresponsive.

  2. Create a statistics database in MySQL:
    mysql> create database stats_db_name DEFAULT CHARACTER SET utf8;
    Query OK, 1 row affected (0.00 sec)
    
    mysql> grant all on stats_db_name.* TO 'stats_user'@'%' IDENTIFIED BY 'stats_password';
    Query OK, 0 rows affected (0.00 sec)
  3. Add the following into the HiveServer2 Configuration Safety Valve for hive-site.xml:
    <property>
      <name>hive.stats.dbclass</name>
      <value>jdbc:mysql</value>
    </property>
    <property>
      <name>hive.stats.jdbcdriver</name>
      <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
      <name>hive.stats.dbconnectionstring</name>
      <value>jdbc:mysql://<stats_mysql_host>:3306/<stats_db_name>?useUnicode=true&amp;
    characterEncoding=UTF-8&amp;user=<stats_user>&amp;password=<stats_password></value>
    </property>
    <property> 
      <name>hive.aux.jars.path</name> 
      <value>file:///usr/share/java/mysql-connector-java.jar</value>
    </property>  
  4. Restart HiveServer2.

Configuring Secure Access for the Impala Web Server

Cloudera Manager supports two methods of authentication for secure access to the Impala web server interfaces: password-based authentication and SSL certificate support. Both of these can be configured through properties of the Impala and Impala StateStore daemons. See Configuring Secure Access for the Impala Web Server.