Data Cache for Remote Reads

When Impala compute nodes and its storage are not co-located, the network bandwidth requirement goes up as the network traffic includes the data fetch as well as the shuffling exchange traffic of intermediate results.

To mitigate the pressure on the network, you can enable the compute nodes to cache the working set read from remote filesystems, such as, remote HDFS data node, S3, ABFS, ADLS.

To enable remote data cache:
  1. In Cloudera Manager, navigate to Clusters > Impala Service > .
  2. In the Configuration tab, select Impala Daemon in Scope, and select Advanced in Category.
  3. Enter the following in the Impala Daemon Command Line Argument Advanced Configuration Snippet (Safety Valve) field, set the --data_cache Impala Daemon start-up flag as below:
    --data_cache=dir1,dir2,dir3,...:quota

    The flag is set to a list of directories, separated by ,, followed by a :, and a capacity quota per directory.

    The directories must be specified in absolute path.

  4. Click Save Changes and restart the Impala service.

If set to an empty string, data caching is disabled.

Cached data is stored in the specified directories.

The specified directories must exist in the local filesystem of each Impala Daemon, or Impala will fail to start.

In addition, the filesystem which the directory resides in must support hole punching.

The cache can consume up to the quota bytes for each of the directories specified.

The default setting for --data_cache is an empty string.

For example, with the following setting, the data cache may use up to 2 TB, with 1 TB max in /data/0 and /data/1 respectively.

--data_cache=/data/0,/data/1:1TB