Using Snappy for MapReduce Compression

It's very common to enable MapReduce intermediate compression, since this can make jobs run faster without you having to make any application changes. Only the temporary intermediate files created by Hadoop for the shuffle phase are compressed (the final output may or may not be compressed). Snappy is ideal in this case because it compresses and decompresses very fast compared to other compression algorithms, such as Gzip. For information about choosing a compression format, see Choosing a Data Compression Format.

To enable Snappy for MapReduce intermediate compression for the whole cluster, set the following properties in mapred-site.xml:

  • For MRv1:
<property>
  <name>mapred.compress.map.output</name>  
  <value>true</value>
</property>
<property>
  <name>mapred.map.output.compression.codec</name>  
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
  • For YARN:
<property>
  <name>mapreduce.map.output.compress</name>  
  <value>true</value>
</property>
<property>
  <name>mapred.map.output.compress.codec</name>  
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

You can also set these properties on a per-job basis.

Use the properties in the following table to compress the final output of a MapReduce job. These are usually set on a per-job basis.

MRv1 Property

YARN Property

Description

mapred.output.compress

mapreduce.output.fileoutputformat.compress

Whether to compress the final job outputs (true or false)

mapred.output.compression.codec

mapreduce.output.fileoutputformat.compress.codec

If the final job outputs are to be compressed, which codec should be used. Set to org.apache.hadoop.io.compress.SnappyCodec for Snappy compression.

mapred.output.compression.type

mapreduce.output.fileoutputformat.compress.type

For SequenceFile outputs, what type of compression should be used (NONE, RECORD, or BLOCK). BLOCK is recommended.