Developer Center
Cloudera Blog · Community Posts

Flume community update: September 2010

The past month has been exciting and productive for the community using and developing Cloudera’s Flume!  This young system is a core part of Cloudera’s Distribution for Hadoop (CDH) that is responsible for streaming data ingest.  There has been a great influx of interest and many contributions, and in this post we will provide a quick summary of this month’s new developments. First, we’re happy to announce the availability of Flume v0.9.1 and we will describe some of its updates. Second, we’ll talk about some of the exciting new integration features coming down the pipeline. Finally we will briefly mention some community growth statistics, as well as some recent and upcoming talks about Flume.

Flume v0.9.1

Flume v0.9.1 is now available both in tarball and packaged forms. This version resolves 63 issues and contains several key improvements and bugs fixes. Much of this release is focused on improving the stability of Flume’s internals to help users quickly get Flume up and running and to help developers build extensions to Flume.

HBase User Group #9: HBase and HDFS
Cloudera’s Hadoop Training Programs Expand Internationally

It’s been over a year now since we started offering Hadoop training in the Bay Area, and since then, we’ve put many of our introductory materials online (for free), and offer in-person public classes in cities around the US (click here for a full list of sessions). The response has been incredible, but one thing is painfully obvious: we’re not doing enough to meet the needs of the growing world-wide Hadoop community.

To that end, we’ve made investments in translating translating our materials into new languages and thinking about how to scale our training programs internationally.

As a first step, we’ll offer our three-day developer training session outside the US later this spring. We’ll announce cities and dates in the EU soon, but we’re happy to announce our first two sessions in Asia now:

HBase Available in CDH2

One of the more common requests we receive from the community is to package HBase with Cloudera’s Distribution for Hadoop. Lately, I’ve been doing a lot of work on making Cloudera’s packages easy to use, and recently, the HBase team has pitched in to help us deliver compatible HBase packages. We’re pretty excited about this, and we’re looking forward to your feedback. A big thanks to Andrew Purtell, a Senior Architect at TrendMicro and HBase Contributor, for leading this packaging project and providing this guest blog post. -Chad Metcalf

What is HBase?
HBase is an open-source, distributed, column-oriented store modeled after Google’s Bigtable large scale structured data storage system. You can read Google’s Bigtable paper here.

“Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from back end bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.”

Grouping Related Trends with Hadoop and Hive

(guest blog post by Pete Skomoroch)

In a previous post, I outlined how to build a basic trend tracking site called trendingtopics.org with Cloudera’s Distribution for Hadoop and Hive.  TrendingTopics uses Hadoop to identify the top articles trending on Wikipedia and displays related news stories and charts.  The data powering the site was pulled from an Amazon EBS Wikipedia Public Dataset containing 8 months of hourly pageview logfiles.  In addition to the pageview logs, the EBS data volume also includes the full text content and link graph for all articles.  This post will use that link graph data to build a new feature for our site: grouping related articles together under a single “lead trend” to ensure the homepage isn’t dominated by a single news story.

Finding Related Trends Using Wikipedia Link Graph Data

CDH2: Cloudera’s Distribution for Hadoop 2

In March of this year, we released our distribution for Hadoop.  Our initial focus was on stability and making Hadoop easy to install. This original distribution, now named CDH1, was based on the most stable version of Apache Hadoop at the time:0.18.3. We packaged up Apache Hadoop, Pig and Hive into RPMs and Debian packages to make managing Hadoop installations easier.  For the first time ever, Hadoop cluster managers were able to bring up a deployment by running one of the following commands depending on your Linux distribution:

# yum install hadoop
# apt-get install hadoop

As proof of this, our easy-to-use Hadoop Amazon Machine Images (AMIs) use these commands at boot to install the latest release of CDH1 whenever a Hadoop cluster is launched on ec2.

Hadoop World: NYC 2009: Speakers Announced

It’s been a crazy few weeks here at Cloudera, and while there is no sign of things letting up before Hadoop World: NYC 2009 on October 2nd, we wanted to take a minute to share the latest details about the speakers, and to say thanks to our sponsors who have recently come on board.

We’re absolutely thrilled to have such a wide variety of organizations sharing their experiences with Apache Hadoop. In addition to a deeply technical track headlined by Cloudera, Yahoo! and Facebook, those new to Hadoop will appreciate an entire track focused on applications. There is also a track for extensions to highlight cool projects like HadoopDB, a mashup of Hadoop and a relational database, from Yale University.

We’re excited about every talk on the schedule, but we wanted to call out just a few that highlight how Hadoop is being used by more traditional enterprise users beyond the web space.

Hadoop World: NYC 2009

To say we were surprised by the quality and quantity of submissions we received for Hadoop World: NYC 2009 would be an understatement. We were amazed at how many “normal” companies have come to use Hadoop for everything ranging from business intelligence to protein alignment. It’s truly exciting to see how a system originally designed to process and index the web has evolved to support the data-driven workloads of so many industries.

It’s with great joy that we invite you to come learn about what the following companies have done with Hadoop: About.com, Booz Allen Hamilton, China Mobile, ContextWeb, eBay, Facebook, IBM, Intel, JPMC, Microsoft, The New York Times, NexR, Rackspace, Vertica, Visa, Visible Measures, Yale, and Yahoo!

If you have ever wondered what Hadoop might be able to do for you, this is your chance to learn  both from leaders in the webspace and within your own industry.

Tracking Trends with Hadoop and Hive on EC2


At Cloudera, we frequently work with leading Hadoop developers to produce guest blog posts of general interest to the community. We started a project with Pete Skomoroch a while back, and we were so impressed with his work, we’ve decided to bring Pete on as a regular guest blogger. Pete can show you how to do some pretty amazing things with Hadoop, Pig and Hive and has a particular bias towards Amazon EC2. With that, I’m happy to welcome Pete to the blog, and hope you enjoy his first post as much as we did. -Christophe

Trendingtopics.org was built by Data Wrangling to demonstrate how Hadoop and Amazon EC2 can be used with Rails to power a data-driven website.  This post will give an overview of how trendingtopics.org was put together and show some basic approaches for finding trends in log data with Hive.  The source code for trendingtopics is available on Github and a tutorial is provided on the Cloudera site which describes many of the data processing steps in greater detail.

The trendingtopics Rails application identifies recent trends on the web by periodically launching an EC2 cluster running Cloudera’s Distribution for Hadoop to process Wikipedia log files.  The cluster runs a Hive batch job that analyzes hourly pageview statistics for millions of Wikipedia articles, and then loads the resulting trend parameters into the application’s MySQL database.

The Project Split

Last Wednesday, we hosted a Hadoop meetup, and I gave a short talk about the new project split. How does the split change the project’s organization, and what does it mean for end users?

The mailing lists and the source code repositories have been rearranged. For those doing development against Hadoop’s “trunk” branch, compiling Hadoop and using the various components in concert has become more complicated.

My presentation slides cover which mailing lists to subscribe to, where the source repositories are located, and how to compile and run the development version of Hadoop.