Cloudera Impala Requirements
To perform as expected, Impala depends on the availability of the software, hardware, and configurations described in the following sections.
Product Compatibility Matrix
Supported Operating Systems
Supported 64-bit operating systems:
Red Hat Enterprise Linux (RHEL), Oracle Linux, or Centos.
On version 5 of Red Hat Enterprise Linux and comparable distributions, some additional setup is needed for the impala-shell interpreter to connect to a Kerberos-enabled Impala cluster:
sudo yum install python-devel openssl-devel python-pip sudo pip-python install ssl
- Ubuntu or Debian.
- For the relevant supported OS versions see the Supported Operating Systems page for CDH 4 or CDH 5.
Supported CDH Versions
Cloudera supports Impala only under the CDH Hadoop distribution. The following list shows the mapping between Impala versions and the earliest corresponding CDH version:
- CDH 4.1 or later 4.x, or CDH 5.1.0 or later 5.1.x, for Impala 1.4.0.
- CDH 4.1 or later 4.x, or CDH 5.0.1 or later 5.0.x, for Impala 1.3.1.
- CDH 5.0 or later for Impala 1.3.0. (The first Impala 1.3.x release for use with CDH 4 is 1.3.1.)
- CDH 4.1 or later 4.x for Impala 1.2.4.
- CDH 4.1 or later 4.x, or CDH 5 beta 2, for Impala 1.2.3.
- CDH 4.1 or later 4.x for Impala 1.2.2 and 1.2.1. (Impala 1.2.2 and 1.2.1 do not work with CDH 5 beta due to differences with packaging and dependencies.)
- CDH 5 beta 1 for Impala 1.2.0.
- CDH 4.1 or later for Impala 1.1.
- CDH 4.1 or later for Impala 1.0.
- CDH 4.1 or later for Impala 0.7.
- CDH 4.2 or later for Impala 0.6.
- CDH 4.1 for Impala 0.5 and earlier. This combination is only supported on RHEL/CentOS.
Hive Metastore and Related Configuration
Impala can interoperate with data stored in Hive, and uses the same infrastructure as Hive for tracking metadata about schema objects such as tables and columns. The following components are prerequisites for Impala:
MySQL or PostgreSQL, to act as a metastore database for both Impala and Hive.
Installing and configuring a Hive metastore is an Impala requirement. Impala does not work without the metastore database. For the process of installing and configuring the metastore, see Installing Impala .
Always configure a Hive metastore service rather than connecting directly to the metastore database. The metastore service is required to interoperate between possibly different levels of metastore APIs used by CDH and Impala, and avoids known issues with connecting directly to the metastore database. The metastore service is set up for you by default if you install through Cloudera Manager 4.5 or later.
A summary of the metastore installation process is as follows:
- Install a MySQL or PostgreSQL database. Start the database if it is not started after installation.
- Download the MySQL connector or the PostgreSQL connector and place it in the /usr/share/java/ directory.
- Use the appropriate command line tool for your database to create the metastore database.
- Use the appropriate command line tool for your database to grant privileges for the metastore database to the hive user.
- Modify hive-site.xml to include information matching your particular database: its URL, user name, and password. You will copy the hive-site.xml file to the Impala Configuration Directory later in the Impala installation process.
- Optional: Hive. Although only the Hive metastore database is required for Impala to function, you might install Hive on some client machines to create and load data into tables that use certain file formats. See How Impala Works with Hadoop File Formats for details. Hive does not need to be installed on the same data nodes as Impala; it just needs access to the same metastore database.
Although Impala is primarily written in C++, it does use Java to communicate with various Hadoop components:
- The officially supported JVM for Impala is the Oracle JVM. Other JVMs might cause issues, typically resulting in a failure at impalad startup. In particular, the JamVM used by default on certain levels of Ubuntu systems can cause impalad to fail to start.
- Internally, the impalad daemon relies on the JAVA_HOME environment variable to locate the system Java libraries. Make sure the impalad service is not run from an environment with an incorrect setting for this variable.
- All Java dependencies are packaged in the impala-dependencies.jar file, which is located at /usr/lib/impala/lib/. These map to everything that is built under fe/target/dependency.
Packages and Repositories
Packages or properly configured repositories. You can install Impala manually using packages, through the Cloudera Impala public repositories, or from your own custom repository. To install using the Cloudera Impala repository, download and install the file to each machine on which you intend to install Impala or Impala Shell. Install the appropriate package or list file as follows:
- Red Hat 5 repo file (http://archive.cloudera.com/impala/redhat/5/x86_64/impala/cloudera-impala.repo) in /etc/yum.repos.d/.
- Red Hat 6 repo file (http://archive.cloudera.com/impala/redhat/6/x86_64/impala/cloudera-impala.repo) in /etc/yum.repos.d/.
- SUSE repo file (http://archive.cloudera.com/impala/sles/11/x86_64/impala/cloudera-impala.repo) in /etc/zypp/repos.d/.
- Ubuntu 10.04 list file (http://archive.cloudera.com/impala/ubuntu/lucid/amd64/impala/cloudera.list) in /etc/apt/sources.list.d/.
- Ubuntu 12.04 list file (http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala/cloudera.list) in /etc/apt/sources.list.d/.
- Debian list file (http://archive.cloudera.com/impala/debian/squeeze/amd64/impala/cloudera.list) in /etc/apt/sources.list.d/.
For example, on a Red Hat 6 system, you might run a sequence of commands like the following:
$ cd /etc/yum.repos.d $ sudo wget http://archive.cloudera.com/impala/redhat/6/x86_64/impala/cloudera-impala.repo $ ls CentOS-Base.repo CentOS-Debuginfo.repo CentOS-Media.repo Cloudera-cdh.repo cloudera-impala.repo
You can retrieve files from the archive.cloudera.com site through wget or curl, but not through rsync.
Optionally, you can install and manage Impala through the Cloudera Manager product. Impala 1.4.0 requires Cloudera Manager 4.8 or higher, although Cloudera Manager 5.1 or higher is recommended to take advantage of configuration settings for new Impala features such as admission control, load-balancing support on Kerberos clusters, and the Impala Best Practices page in Cloudera Manager.
In a Cloudera Manager environment, the catalog service is not recognized or managed by Cloudera Manager versions prior to 4.8. Cloudera Manager 4.8 and higher require the catalog service to be present for Impala. Therefore, if you upgrade to Cloudera Manager 4.8 or higher, you must also upgrade Impala to 1.2.1 or higher. Likewise, if you upgrade Impala to 1.2.1 or higher, you must also upgrade Cloudera Manager to 4.8 or higher.
When you install through Cloudera Manager, you can
install either with the OS-specific
Networking Configuration Requirements
As part of ensuring best performance, Impala attempts to complete tasks on local data, as opposed to using network connections to work with remote data. To support this goal, Impala matches the hostname provided to each Impala daemon with the IP address of each datanode by resolving the hostname flag to an IP address. For Impala to work with local data, use a single IP interface for the datanode and the Impala daemon on each machine. Ensure that the Impala daemon's hostname flag resolves to the IP address of the datanode. For single-homed machines, this is usually automatic, but for multi-homed machines, ensure that the Impala daemon's hostname resolves to the correct interface. Impala tries to to detect the correct hostname at start-up, and prints the derived hostname at the start of the log in a message of the form:
Using hostname: impala-daemon-1.cloudera.com
In the majority of cases, this automatic detection works correctly. If you need to explicitly set the hostname, do so by setting the --hostname flag.
During join operations, portions of data from each joined table are loaded into memory. Data sets can be very large, so ensure your hardware has sufficient memory to accommodate the joins you anticipate completing.
While requirements vary according to data set size, the following is generally recommended:
CPU - Impala uses the SSE4.2 instruction set, which is included in newer processors. Impala can use older
processors as long as they support the SSE3 instruction set, but for best performance use:
- Intel - Nehalem (released 2008) or later processors.
- AMD - Bulldozer (released 2011) or later processors.
- Memory - 128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query processing on a particular node exceed the amount of memory available to Impala on that node, the query is cancelled. Note that because the work is parallelized, and intermediate results for aggregate queries are typically smaller than the original data, Impala can query and join tables that are much larger than the memory available on an individual node.
- Storage - DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for disk performance with Impala. Ensure that you have sufficient disk space to store the data Impala will be querying.
User Account Requirements
Impala creates and uses a user and group named impala. Do not delete this account or group and do not modify the account's or group's permissions and rights. Ensure no existing systems obstruct the functioning of these accounts and groups. For example, if you have scripts that delete user accounts not in a white-list, add these accounts to the list of permitted accounts.
For the resource management feature to work (in combination with CDH 5 and the YARN and Llama components), the impala user must be a member of the hdfs group. This setup is performed automatically during a new install, but not when upgrading from earlier Impala releases to Impala 1.2. If you are upgrading a node to CDH 5 that already had Impala 1.1 or 1.0 installed, manually add the impala user to the hdfs group. For Llama installation instructions, see Llama installation.
For correct file deletion during DROP TABLE operations, Impala must be able to move files to the HDFS trashcan. You might need to create an HDFS directory /user/impala, writeable by the impala user, so that the trashcan can be created. Otherwise, data files might remain behind after a DROP TABLE statement.
Impala should not run as root. Best Impala performance is achieved using direct reads, but root is not permitted to use direct reads. Therefore, running Impala as root negatively affects performance.
By default, any user can connect to Impala and access all the associated databases and tables. You can enable authorization and authentication based on the Linux OS user who connects to the Impala server, and the associated groups for that user. Impala Security for details. These security features do not change the underlying file permission requirements; the impala user still needs to be able to access the data files.