Developer Center
Cloudera Blog

Hadoop and the Cloudera Data Platform.

Tracking Trends with Hadoop and Hive on EC2


At Cloudera, we frequently work with leading Hadoop developers to produce guest blog posts of general interest to the community. We started a project with Pete Skomoroch a while back, and we were so impressed with his work, we’ve decided to bring Pete on as a regular guest blogger. Pete can show you how to do some pretty amazing things with Hadoop, Pig and Hive and has a particular bias towards Amazon EC2. With that, I’m happy to welcome Pete to the blog, and hope you enjoy his first post as much as we did. -Christophe

Trendingtopics.org was built by Data Wrangling to demonstrate how Hadoop and Amazon EC2 can be used with Rails to power a data-driven website.  This post will give an overview of how trendingtopics.org was put together and show some basic approaches for finding trends in log data with Hive.  The source code for trendingtopics is available on Github and a tutorial is provided on the Cloudera site which describes many of the data processing steps in greater detail.

The trendingtopics Rails application identifies recent trends on the web by periodically launching an EC2 cluster running Cloudera’s Distribution for Hadoop to process Wikipedia log files.  The cluster runs a Hive batch job that analyzes hourly pageview statistics for millions of Wikipedia articles, and then loads the resulting trend parameters into the application’s MySQL database.

accg8khj3q9b_76cvh9b7h9_b

Application Features

Advice on QA Testing Your MapReduce Jobs

As Hadoop adoption increases among organizations, companies, and individuals, and as it makes its way into production, testing MapReduce (MR) jobs becomes more and more important. By regularly running tests on your MR jobs–either invoked by developers before they commit a change or by a continuous integration server such as hudson–an engineering organization can catch bugs early, strive for quality, and make developing and maintaining MR jobs easier and faster.

MR jobs are particularly difficult to test thoroughly because they run in a distributed environment.  This post will give specific advice on how an engineering team might QA test its MR jobs. Note that Chapter 5 of Hadoop: The Definitive Guide gives specific code examples for testing an MR job.

As is the case with most testing scenarios, there are certain practices one can follow that have a low barrier to entry; such practices might do a fairly sufficient job of testing. There are also practices one can follow that are more complicated but perhaps result in more thorough testing. Let’s walk through some good QA practices, starting with the easiest and ending with the most complicated.

Traditional Unit Tests – JUnit, PyUnit, Etc.

Running the Cloudera Training VM in VirtualBox

Cloudera’s Training VM is one of the most popular resources on our website. It was created with VMware Workstation, and plays nicely with the VMware Player for Windows, Linux, and Mac. But VMware isn’t for everyone. Thomas Lockney has managed to get our VM image running on Virtual Box, and has written a step-by-step guide for the community. Thanks Thomas! – Christophe

I was quite pleased when I discovered that Cloudera had created a virtual machine image that could be used while working through their training material. It would make the process simpler, and it looked like a potentially useful environment for general Hadoop experimentation. However, their VM is built for VMware, which I stopped using a while back. However, as a heavy VirtualBox user, I knew that it would not be hard to get it running in my preferred desktop virtualization environment.

Here’s a step-by-step guide for getting Cloudera’s virtual machine image up and running. I’ll include screenshots for most of the steps to make it as clear as possible. I’ll assume you already have at least some familiarity with running VirtualBox (if not, there are plenty of good tutorials and references available online) and some experience with Ubuntu or some other fairly modern Linux desktop system.

Hadoop HA Configuration

One of the things we get a lot of questions about is how to make Hadoop highly available. There is still a lot of work to be done on this front, but we wanted to take a moment and share the best practices from one of our customers. Check out what Paul George has to say about how they keep thier NameNode up at ContextWeb. – Christophe

Here at ContextWeb, our Hadoop infrastructure has become a critical part of our day-to-day business operations. As such, it was important for us to find a way to resolve the single-point-of-failure issue that surrounds the master node processes, namely the NameNode and JobTracker. While it was easy for us to follow the best practice of offloading the secondary NameNode data to an NFS mount to protect metadata, ensuring that the processes were constantly available for job execution and data retrieval were of greater importance. We’ve leveraged some existing, well tested components that are available and commonly used in Linux systems today. Our solution primarily makes use of DRBD from LINBIT and Heartbeat from the Linux-HA project. The natural combination of these two projects provides us with a reliable and highly available solution, which addresses limitations that currently exist.

While one could conceivably expand the use of these two projects to much deeper levels of protection, the goal of this post is to provide a basic working configuration as a starting point for further experimentation and tuning. There may be variations with regards to what works with your distribution or which requirements your organization has for SLAs and HA standards. These instructions are most relevant to CentOS 5.3 combined with Cloudera’s Distribution for Hadoop, since that’s what we run in our production environment.

Hadoop Environment

The Project Split

Last Wednesday, we hosted a Hadoop meetup, and I gave a short talk about the new project split. How does the split change the project’s organization, and what does it mean for end users?

The mailing lists and the source code repositories have been rearranged. For those doing development against Hadoop’s “trunk” branch, compiling Hadoop and using the various components in concert has become more complicated.

My presentation slides cover which mailing lists to subscribe to, where the source repositories are located, and how to compile and run the development version of Hadoop.

File Appends in HDFS

There is some confusion about the state of the file append operation in HDFS. It was in, now it’s out. Why was it removed, and when will it be reinstated? This post looks at some of the history behind HDFS capability for supporting file appends.

Background

Early versions of HDFS had no support for an append operation. Once a file was closed, it was immutable and could only be changed by writing a new copy with a different filename. This style of file access actually fits very nicely with MapReduce, where you write the output of a data processing job to a set of new files; this is much more efficient than manipulating the input files that are already in place.

A file didn’t exist until it had been successfully closed (by calling FSDataOutputStream’s close() method). If the client failed before it closed the file, or if the close() method failed by throwing an exception, then (to other clients at least), it was as if the file had never been written. The only way to recover the file was to rewrite it from the beginning. MapReduce worked well with this behavior, since it would simply rerun the task that had failed from the beginning.

First Steps Toward Append

Hadoop Graphing with Cacti

An important part of making sure Hadoop works well for all users is developing and maintaining strong relationships with the folks who run Hadoop day in and day out. Edward Capriolo keeps About.com’s Hadoop cluster happy, and we frequently chew the fat with Ed on issues ranging from administrative best practices to monitoring. Ed’s been an invaluable resource as we beta test our distribution and chase down bugs before our official releases. Today’s article looks at some of Ed’s tricks for monitoring Hadoop with Cacti through JMX. -Christophe

You may have already read Philip’s Hadoop Metrics post, which provides a general overview of the Hadoop Metrics system. Here, we’ll examine Hadoop monitoring with Cacti through JMX.

What is Cacti?

Cacti is an RRD front end. You can learn more about it on the Cacti website.

Debugging MapReduce Programs With MRUnit

The distributed nature of MapReduce programs makes debugging a challenge. Attaching a debugger to a remote process is cumbersome, and the lack of a single console makes it difficult to inspect what is occurring when several distributed copies of a mapper or reducer are running concurrently. Furthermore, operations that work on small amounts of input (e.g., saving the inputs to a reducer in an array) fail when running at scale, causing out-of-memory exceptions or other unintended effects.

A full discussion of how to debug MapReduce programs is beyond the scope of a single blog post, but I’d like to introduce you to a tool we designed at Cloudera to assist you with MapReduce debugging: MRUnit.

MRUnit helps bridge the gap between MapReduce programs and JUnit by providing a set of interfaces and test harnesses, which allow MapReduce programs to be more easily tested using standard tools and practices.