Testing Impala Performance
Test to ensure that Impala is configured for optimal performance. If you have installed Impala without Cloudera Manager, complete the processes described in this topic to help ensure a proper configuration. Even if you installed Impala with Cloudera Manager, which automatically applies appropriate configurations, these procedures can be used to verify that Impala is set up correctly.
Checking Impala Configuration Values
You can inspect Impala configuration values by connecting to your Impala server using a browser.
To check Impala configuration values:
- Use a browser to connect to one of the hosts running impalad in your environment. Connect using an address of the form http://hostname:port/varz.
Note: In the preceding example, replace hostname and port with the name and port of your Impala server. The default port is 25000.
- Review the configured values.
For example, to check that your system is configured to use block locality tracking information, you would check that the value for dfs.datanode.hdfs-blocks-metadata.enabled is true.
To check data locality:
- Execute a query on a dataset that is available across multiple nodes. For example, for a table named MyTable that has a reasonable chance of being spread
across multiple DataNodes:
[impalad-host:21000] > SELECT COUNT (*) FROM MyTable
- After the query completes, review the contents of the Impala logs. You should find a recent message similar to the following:
Total remote scan volume = 0
The presence of remote scans may indicate impalad is not running on the correct nodes. This can be because some DataNodes do not have impalad running or it can be because the impalad instance that is starting the query is unable to contact one or more of the impalad instances.
To understand the causes of this issue:
- Connect to the debugging web server. By default, this server runs on port 25000. This page lists all impalad instances running in your cluster. If there are fewer instances than you expect, this often indicates some DataNodes are not running impalad. Ensure impalad is started on all DataNodes.
- If you are using multi-homed hosts, ensure that the Impala daemon's hostname resolves to the interface on which impalad is running. The hostname Impala is using is displayed when impalad starts. To explicitly set the hostname, use the --hostname flag.
- Check that statestored is running as expected. Review the contents of the state store log to ensure all instances of impalad are listed as having connected to the state store.
Reviewing Impala Logs
You can review the contents of the Impala logs for signs that short-circuit reads or block location tracking are not functioning. Before checking logs, execute a simple query against a small HDFS dataset. Completing a query task generates log messages using current settings. Information on starting Impala and executing queries can be found in Starting Impala and Using the Impala Shell (impala-shell Command). Information on logging can be found in Using Impala Logging. Log messages and their interpretations are as follows:
Unknown disk id. This will negatively affect performance. Check your hdfs settings to enable block location metadata
Tracking block locality is not enabled.
Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Native checksumming is not enabled.
|<< Using HDFS Caching with Impala (CDH 5.1 or higher only)||©2016 Cloudera, Inc. All rights reserved||Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles >>|