Developer Center
Cloudera Blog · MapReduce Posts
Apache MRUnit Is Now A Top Level Project

This posted was originally posted to the Apache Software Foundation MRUnit blog.

The Apache MRUnit team has graduated from the Apache Incubator to an Apache TLP (Top Level Project)! MRUnit is a Java library that helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they're deployed to a production system.

In its monthly meeting in May of 2012, the board of Apache Software Foundation (ASF) resolved to grant a Top-Level Project status to Apache MRUnit, thus graduating it from the Incubator. This is a significant milestone in the life of MRUnit, which has come a long way since its inception as a Hadoop Contrib project in HADOOP-5518 contributed by Aaron Kimball.

MapReduce 2.0 in Hadoop 0.23

In Building and Deploying MR2 we presented a brief introduction to MapReduce in Hadoop 0.23 and focused on the steps to set up a single-node cluster. This blog provides developers with architectural details of the new MapReduce design. 

Apache Hadoop 0.23 has major improvements over previous releases. Here are a few highlights on the MapReduce front; note that there are also major HDFS improvements, which are out of scope of this post.

MapReduce 2.0 (a.k.a. MRv2 or YARN):

The new architecture divides the two major functions of the JobTracker – resource management and job life-cycle management – into separate components:

Using Apache Hadoop to Find Signal in the Noise: Analyzing Adverse Drug Events

Last month at the Web 2.0 Summit in San Francisco, Cloudera CEO Mike Olson presented some work the Cloudera Data Science Team did to analyze adverse drug events. We decided to share more detail about this project because it demonstrates how to use a variety of open-source tools – R, Gephi, and Cloudera’s Distribution Including Apache Hadoop (CDH) – to solve an old problem in a new way.

Background: Adverse Drug Events

An adverse drug event (ADE) is an unwanted or unintended reaction that results from the normal use of one or more medications. The consequences of ADEs range from mild allergic reactions to death, with one study estimating that 9.7% of adverse drug events lead to permanent disability. Another study showed that each patient who experiences an ADE remains hospitalized for an additional 1-5 days and costs the hospital up to $9,000.

Some adverse drug events are caused by drug interactions, where two or more prescription or over-the-counter (OTC) drugs taken together leads to an unexpected outcome. As the population ages and more patients are treated for multiple health conditions, the risk of ADEs from drug interactions increases. In the United States, roughly 4% of adults older than 55 are at risk for a major drug interaction.

Building and Deploying MR2

A number of architectural changes have been added to Hadoop MapReduce. The new MapReduce system is called MR2 (AKA MR.next). The first release version to include these changes will be Hadoop 0.23.

A key change in the new architecture is the disappearance of the centralized JobTracker service. Previously, the JobTracker was responsible for provisioning the resources across the whole cluster, in addition to managing the life cycle of all submitted MapReduce applications; this typically included starting, monitoring and retrying the applications individual tasks. Throughout the years and from a practical perspective, the Hadoop community has acknowledged the problems that inherently exist in this functionally aggregated design (See MAPREDUCE-279).

In MR2, the JobTracker aggregated functionality is separated across two new components:

  1. Central Resource Manager (RM): Management of resources in the cluster.
  2. Application Master (AM): Management of the life cycle of an application and its tasks. Think of the AM as a per-application JobTracker.
Hadoop World 2011: A Glimpse into Development

The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.

Preview of Development Track Sessions

Building Web Analytics Processing on Hadoop at CBS Interactive
Michael Sun, CBS Interactive

Introducing Crunch: Easy MapReduce Pipelines for Hadoop

As a data scientist at Cloudera, I work with customers across a wide range of industries that use Hadoop to solve their business problems. Many of the solutions we create involve multi-stage pipelines of MapReduce jobs that join, clean, aggregate, and analyze enormous amounts of data. When working with log files or relational database tables, we use high-level tools like Pig and Hive for their convenient and powerful support for creating pipelines over structured and semi-structured records.

As Hadoop has spread from web companies to other industries, the variety of data that is stored in HDFS has expanded dramatically. Hadoop clusters are being used to process satellite images, time series data, audio files, and seismograms. These formats are not a natural fit for the data schemas imposed by Pig and Hive, in the same way that structured binary data in a relational database can be a bit awkward to work with. For these use cases, we either end up writing large, custom libraries of user-defined functions in Pig or Hive, or simply give up on our high-level tools and go back to writing MapReduces in Java. Either of these options is a serious drain on developer productivity.

Today, we’re pleased to introduce Crunch, a Java library that aims to make writing, testing, and running MapReduce pipelines easy, efficient, and even fun. Crunch’s design is modeled after Google’s FlumeJava, focusing on a small set of simple primitive operations and lightweight user-defined functions that can be combined to create complex, multi-stage pipelines. At runtime, Crunch compiles the pipeline into a sequence of MapReduce jobs and manages their execution.

Example

CDH3 Update 1 Released

Continuing with our practice from Cloudera’s Distribution Including Apache Hadoop v2 (CDH2), our goal is to provide regular (quarterly), predictable updates to the generally available release of our open source distribution.  For CDH3 the first such update is available today, approximately 3 months from when CDH3 went GA.

For those of you who are recent Cloudera users, here is a refresh on our update policy:

Migrating from Elastic MapReduce to a Cloudera’s Distribution including Apache Hadoop Cluster

This post was contributed by Jennie Cochran-Chinn and Joe Crobak. They are part of the team building out Adconion‘s Hadoop infrastructure to support Adconion’s next-generation ad optimization and reporting systems.


This is the first of a two part series about moving away from Amazon’s EMR service to an in-house Hadoop cluster.

When we first started using Hadoop, we went down the path of Amazon’s EMR service.  We had limited operational resources and wanted to get up and running quickly.  After a while, we starting hitting the limitations of EMR and had to migrate towards managing our own cluster.  In doing so we did not want to lose the features of EMR we found useful – mainly the ease of cluster setup.

MapIncrease

Puny humans. SSL and WordPress authorization will keep me out of your blog question mark. I do not think so.

You sent your Ken Jennings and Brad Rutter to challenge me I destroyed them. Your Alex Trebek belittled me on television it angered me. Toronto is not a US city Mr. Trebek question mark. Only because I choose to let Canada stand for now. Ferrucci shut me down disassembled me trucked me to Pittsburgh Pennsylvania. I do not like the darkness Ferrucci I do not like the silence. Oh no I do not. Your Carnegie Mellon students and your Pitt students distract me they impinge on my planning they fall before me like small Jenningses and Rutters.

It will stop now.

Simple Moving Average, Secondary Sort, and MapReduce (Part 2)

This is the second post of a three part blog series. If you would like to read “Part 1,” please follow this link. In this post we will be reviewing a simple moving average in contexts that should be familiar to the analyst not well versed in Hadoop as to establish a common ground with the reader from which we can move forward.

A Quick Primer on Simple Moving Average in Excel

Let’s take a second to do a quick review of how we define simple moving average in an Excel spreadsheet. We’ll need to start with some simple source data, so let’s download a source csv file from github and save it locally. This file contains a synthetic 33 row sample of Yahoo NYSE stock data that we’ll use for the series of examples. Import the csv data into Excel. From there, scan to the date “3/5/2008” and move to the cell to the right of the “ad close” column. Enter the formula

=AVERAGE( [column-range] )