Developer Center
Cloudera Blog · Avro Posts

Tracing with Avro

Written by Patrick Wendell, an amazing summer intern with Cloudera and an Avro Committer.

In my summer internship project at Cloudera, I added RPC tracing as a first-order feature of Apache Avro. Avro is a platform for data storage and exchange that caters to data-intensive, dynamic applications. My project focused on Avro’s RPC functionality.

Avro 1.3.0

Avro was added the to Hadoop family last April and last year there were three Apache Avro releases: 1.0.0 in July, 1.1.0 in September and 1.2.0 in October.  After the 1.2.0 release, Doug Cutting introduced Avro: a New Format for Data Interchange on this blog and the Avro team went right to work building the next release of Avro.

It’s a new year and there’s a new Avro: 1.3.0.

Starting with Avro 1.3.0, the Avro team is releasing packages specially tailored to consumers of each language.  For example, Python users can download an egg, Java users can manage jars using Maven and C/C++ users can grab an autotools package ready to `./configure; make`.  Speaking of languages, we’re thrilled to announce that there’s a Ruby implementation for Avro now!

Avro: a New Format for Data Interchange

Avro is a recent addition to Apache’s Hadoop family of projects.  Avro defines a
data format designed to support data-intensive applications, and provides support for this format in a variety of programming languages.

Background

We’d like data-driven applications to be dynamic: folks should be able to rapidly combine datasets from different sources.  We want to facilitate novel, innovative exploration of data.  Someone should, for example, ideally be able to easily correlate point-of-sale transactions, web site visits, and externally provided demographic data, without a lot of preparatory work.  This should be possible on-the-fly, using scripting and interactive tools.

Current data formats often don’t work well for this.  XML and JSON are expressive, but they’re big, and slow to process.  When you’re processing petabytes of data, size and speed matter a lot.