Using Avro with Hive

The following example demonstrates how to create a Hive table that is backed by Avro data files:

CREATE TABLE doctors
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{
  "namespace": "testing.hive.avro.serde",
  "name": "doctors",
  "type": "record",
  "fields": [
    {
      "name":"number",
      "type":"int",
      "doc":"Order of playing the role"
    },
    {
      "name":"first_name",
      "type":"string",
      "doc":"first name of actor playing role"
    },
    {
      "name":"last_name",
      "type":"string",
      "doc":"last name of actor playing role"
    },
    {
      "name":"extra_field",
      "type":"string",
      "doc:":"an extra field not in the original file",
      "default":"fishfingers and custard"
    }
  ]
}');

LOAD DATA LOCAL INPATH '/usr/share/doc/hive-0.7.1+42.55/examples/files/doctors.avro' INTO TABLE doctors;

You could also create an Avro backed Hive table by using an Avro schema file:

CREATE TABLE my_avro_table(notused INT)
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  WITH SERDEPROPERTIES (
    'avro.schema.url'='file:///tmp/schema.avsc')
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

The avro.schema.url is a URL (here a file:// URL) pointing to an Avro schema file that is used for reading and writing, it could also be an hdfs URL, eg. hdfs://hadoop-namenode-uri/examplefile

To enable Snappy compression on output files, run the following before writing to the table:

SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;

You will also need to include the snappy-java JAR in --auxpath. The snappy-java JAR is located at:

/usr/lib/hive/lib/snappy-java-1.0.4.1.jar

Haivvreo SerDe has been merged into Hive as AvroSerDe, and it is no longer supported in its original form. schema.url and schema.literal have been changed to avro.schema.url and avro.schema.literal as a result of the merge. If you were you using Haivvreo SerDe, you can use the new Hive AvroSerDe with tables created with the Haivvreo SerDe. For example, if you have a table my_avro_table that uses the Haivvreo SerDe, you can do the following to make the table use the new AvroSerDe:

ALTER TABLE my_avro_table SET SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe';

ALTER TABLE my_avro_table SET FILEFORMAT
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

Categories: Avro | File Formats | Hive | All Categories

Using Avro with Pig

Using the Parquet File Format with Impala, Hive, Pig, and MapReduce