Using Avro with Pig

CDH provides AvroStorage for Avro integration in Pig.

To use it, first register the piggybank JAR file and supporting libraries:

REGISTER piggybank.jar
REGISTER lib/avro-1.7.3.jar
REGISTER lib/json-simple-1.1.jar
REGISTER lib/snappy-java-1.0.4.1.jar

Then you can load Avro data files as follows:

a = LOAD 'my_file.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage();

Pig maps the Avro schema to a corresponding Pig schema.

You can store data in Avro data files with:

store b into 'output' USING org.apache.pig.piggybank.storage.avro.AvroStorage();

In the case of store, Pig generates an Avro schema from the Pig schema. It is possible to override the Avro schema, either by specifying it literally as a parameter to AvroStorage, or by using the same schema as an existing Avro data file. See the Pig wiki for details.

To store two relations in one script, specify an index to each store function. Here is an example:

set1 = load 'input1.txt' using PigStorage() as ( ... );
store set1 into 'set1' using org.apache.pig.piggybank.storage.avro.AvroStorage('index', '1');

set2 = load 'input2.txt' using PigStorage() as ( ... );
store set2 into 'set2' using org.apache.pig.piggybank.storage.avro.AvroStorage('index', '2');

For more information, see the AvroStorage wiki; look for "index".

To enable Snappy compression on output files do the following before issuing the STORE statement:

SET mapred.output.compress true
SET mapred.output.compression.codec org.apache.hadoop.io.compress.SnappyCodec
SET avro.output.codec snappy

There is some additional documentation on the Pig wiki. Note, however, that the version numbers of the JAR files to register are different on that page, so you should adjust them as shown above.