Developer Center
Cloudera Blog

Rackspace Upgrades to Cloudera’s Distribution for Hadoop
Parallel LZO: Splittable Compression for Hadoop


Yesterday, Chris Goffinet from Digg made a great blog post about LZO and Hadoop. Many users have been frustrated because LZO has been removed from Hadoop’s core, and Chris highlights a great way to mitigate this while the project identifies an alternative with a compatible license. We liked the post so much, we asked Chris to share it with our audience. Thanks Chris! -Christophe

So at Digg, we have been working our own Hadoop cluster using Cloudera’s distribution. One of the things we have been working through is how can we split our large compressed data and run them in parallel on Hadoop? One of the biggest drawbacks from compression algorithms like Gzip is that you can’t split them into multiple mappers. This is where LZO comes in.

Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm that is focused on decompression speed.

A Great Week for Hadoop: Summit Roundup

On June 10th, more than 750 people from around the world descended on the Santa Clara Marriott to share their love for a little stuffed elephant named Hadoop. It was a good week to be part of this exploding community, and I want to extend Cloudera’s heartfelt thanks to everyone who made it possible, especially our friends at Yahoo! who organized this Summit. Most importantly, I want to thank all of you who were able to participate. I know many of you couldn’t make it to California this time, so I hope to see you at the Hadoop Summit East in October.

For those of you who couldn’t join us, I thought I would post my notes on a few of the highlights.

Hadoop Goes Mainstream:
About 300 developers attended last year’s summit, primarily from web companies and research labs. They were joined by a few forward-thinking venture capitalists. This year’s audience was both larger and different. In addition to the vibrant developer community, there was a flood of users of Hadoop. Though the audience was still dominated by web companies, attendees included traditional enterprise users with applications ranging from finance to biotech. There were technology previews from IBM and Sun. Major companies like Amazon joined our commercial efforts around Hadoop. VCs had also stepped up to sponsor status. Take-away? You ain’t seen nothing yet.

Analyzing Apache logs with Pig

(guest blog post by Dmitriy Ryaboy)

A number of organizations donate server space and bandwidth to the Apache Foundation; when you download Hadoop, Tomcat, Maven, CouchDB, or any of the other great Apache projects, the bits are sent to you from a large list of mirrors. One of the ways in which Cloudera supports the open source community is to host such a mirror.

In this blog post, we will use Pig to examine the download logs recorded on our server, demonstrating several features that are often glossed over in introductory Pig tutorials—parameter substitution in PigLatin scripts, Pig Streaming, and the use of custom loaders and user-defined functions (UDFs). It’s worth mentioning here that, as of last week, the Cloudera Distribution for Hadoop includes a package for Pig version 0.2 for both Red Hat and Ubuntu, as promised in an earlier post. It’s as simple as apt-get install pig or yum install hadoop-pig.

The Smart Grid: Hadoop at the Tennessee Valley Authority (TVA)
Introducing Sqoop

In addition to providing you with a dependable release of Hadoop that is easy to configure, at Cloudera we also focus on developing tools to extend Hadoop’s usability, and make Hadoop a more central component of your data infrastructure. In this vein, we’re proud to announce the availability of Sqoop, a tool designed to easily import information from SQL databases into your Hadoop cluster.

Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities: