Optimizing Performance in CDH

This section provides solutions to some performance problems, and describes configuration best practices.

Disabling Transparent Hugepage Compaction

Most Linux platforms supported by CDH 5 include a feature called transparent hugepage compaction which interacts poorly with Hadoop workloads and can seriously degrade performance.

Symptom: top and other system monitoring tools show a large percentage of the CPU usage classified as "system CPU". If system CPU usage is 30% or more of the total CPU usage, your system may be experiencing this issue.

What to do:
  1. To see whether transparent hugepage compaction is enabled, run the following command and check the output:
    $ cat defrag_file_pathname
    • [always] never means that transparent hugepage compaction is enabled.
    • always [never] means that transparent hugepage compaction is disabled.
  2. To disable transparent hugepage compaction, add the following command to /etc/rc.local:
     echo never > defrag_file_pathname

You can also disable transparent hugepage compaction interactively (but remember this will not survive a reboot).

To disable transparent hugepage compaction temporarily as root:
# echo 'never' > defrag_file_pathname 
To disable transparent hugepage compaction temporarily using sudo:
$ sudo sh -c "echo 'never' > defrag_file_pathname" 

Setting the vm.swappiness Linux Kernel Parameter

vm.swappiness is a Linux kernel parameter that controls how aggressively memory pages are swapped to disk. It can be set to a value between 0-100; the higher the value, the more aggressive the kernel is in seeking out inactive memory pages and swapping them to disk.

You can see what value vm.swappiness is currently set to by looking at /proc/sys/vm; for example:

cat /proc/sys/vm/swappiness

On most systems, it is set to 60 by default. This is not suitable for Hadoop cluster nodes, because it can cause processes to get swapped out even when there is free memory available. This can affect stability and performance, and may cause problems such as lengthy garbage collection pauses for important system daemons. Cloudera recommends that you set this parameter to 10 or less; for example:

# sysctl -w vm.swappiness=10 

Cloudera previously recommended a setting of 0, but in recent kernels (such as those included with RedHat 6.4 and higher, and Ubuntu 12.04 LTS and higher) a setting of 0 might lead to out of memory issues per this blog post: http://www.percona.com/blog/2014/04/28/oom-relation-vm-swappiness0-new-kernel/.

Improving Performance in Shuffle Handler and IFile Reader

The MapReduce shuffle handler and IFile reader use native Linux calls (posix_fadvise(2) and sync_data_range) on Linux systems with Hadoop native libraries installed. The subsections that follow provide details.

Shuffle Handler

You can improve MapReduce shuffle handler performance by enabling shuffle readahead. This causes the TaskTracker or Node Manager to pre-fetch map output before sending it over the socket to the reducer.

  • To enable this feature for YARN, set the mapreduce.shuffle.manage.os.cache property to true (default). To further tune performance, adjust the value of the mapreduce.shuffle.readahead.bytes property. The default value is 4MB.
  • To enable this feature for MRv1, set the mapred.tasktracker.shuffle.fadvise property to true (default). To further tune performance, adjust the value of the mapred.tasktracker.shuffle.readahead.bytes property. The default value is 4MB.

IFile Reader

Enabling IFile readahead increases the performance of merge operations. To enable this feature for either MRv1 or YARN, set the mapreduce.ifile.readahead property to true (default). To further tune the performance, adjust the value of the mapreduce.ifile.readahead.bytes property. The default value is 4MB.

Best Practices for MapReduce Configuration

The configuration settings described below can reduce inherent latencies in MapReduce execution. You set these values in mapred-site.xml.

Send a heartbeat as soon as a task finishes

Set the mapreduce.tasktracker.outofband.heartbeat property to true to let the TaskTracker send an out-of-band heartbeat on task completion to reduce latency; the default value is false:

<property>
    <name>mapreduce.tasktracker.outofband.heartbeat</name>
    <value>true</value>
</property>

Reduce the interval for JobClient status reports on single node systems

The jobclient.progress.monitor.poll.interval property defines the interval (in milliseconds) at which JobClient reports status to the console and checks for job completion. The default value is 1000 milliseconds; you may want to set this to a lower value to make tests run faster on a single-node cluster. Adjusting this value on a large production cluster may lead to unwanted client-server traffic.

<property>
    <name>jobclient.progress.monitor.poll.interval</name>
    <value>10</value>
</property>

Tune the JobTracker heartbeat interval

Tuning the minimum interval for the TaskTracker-to-JobTracker heartbeat to a smaller value may improve MapReduce performance on small clusters.

<property>
    <name>mapreduce.jobtracker.heartbeat.interval.min</name>
    <value>10</value>
</property>

Start MapReduce JVMs immediately

The mapred.reduce.slowstart.completed.maps property specifies the proportion of Map tasks in a job that must be completed before any Reduce tasks are scheduled. For small jobs that require fast turnaround, setting this value to 0 can improve performance; larger values (as high as 50%) may be appropriate for larger jobs.

<property>
    <name>mapred.reduce.slowstart.completed.maps</name>
    <value>0</value>
</property>

Tips and Best Practices for Jobs

This section describes changes you can make at the job level.

Use the Distributed Cache to Transfer the Job JAR

Use the distributed cache to transfer the job JAR rather than using the JobConf(Class) constructor and the JobConf.setJar() and JobConf.setJarByClass() methods.

To add JARs to the classpath, use -libjars jar1,jar2, which will copy the local JAR files to HDFS and then use the distributed cache mechanism to make sure they are available on the task nodes and are added to the task classpath.

The advantage of this over JobConf.setJar is that if the JAR is on a task node it won't need to be copied again if a second task from the same job runs on that node, though it will still need to be copied from the launch machine to HDFS.

For more information, see item 1 in the blog post How to Include Third-Party Libraries in Your MapReduce Job.

Changing the Logging Level on a Job (MRv1)

You can change the logging level for an individual job. You do this by setting the following properties in the job configuration (JobConf):

  • mapreduce.map.log.level
  • mapreduce.reduce.log.level

Valid values are NONE, INFO, WARN, DEBUG, TRACE, and ALL.

Example:

JobConf conf = new JobConf();
...

conf.set("mapreduce.map.log.level", "DEBUG");
conf.set("mapreduce.reduce.log.level", "TRACE");
...