Intro
In this three part blog series I want to take a look at how we would do a Simple Moving Average with MapReduce and Apache Hadoop. This series is meant to show how to translate a common Excel or R function into MapReduce java code with accompanying working code and data to play with. Most analysts can take a few months of stock data and produce an excel spreadsheet that shows a moving average, but doing this in Hadoop might be a more daunting task. Although time series as a topic is relatively well understood, I wanted to take the approach of using a simple topic to show how it translated into a powerful parallel application that can calculate the simple moving average for a lot of stocks simultaneously with MapReduce and Hadoop. I also want to demonstrate the underlying mechanic of using the “secondary sort” technique with Hadoop’s MapReduce shuffle phase, which we’ll see is applicable to a lot of different application domains such as finance, sensor, and genomic data.
This article should be approachable to the beginner Hadoop programmer who has done a little bit of MapReduce in java and is looking for a slightly more challenging MapReduce application to hack on. In case you’re not very familiar with Hadoop, here’s some background information and CDH. The code in this example is hosted on github and is documented to illustrate how the various components work together to achieve the secondary sort effect. One of the goals of this article is to have this code be relatively basic and approachable by most programmers.
So let’s take a quick look at what time series data is and where it is employed in the quickly emerging world of large-scale data.
This post is courtesy of Kumanan Rajamanikkam, Lead Engineer at Wordnik.
Wordnik’s Processing Challenge
At Wordnik, our goal is to build the most comprehensive, high-quality understanding of English text. We make our findings available through a robust REST api and www.wordnik.com. Our corpus grows quickly—up to 8,000 words per second. Performing deep lexical analysis on data at this rate is challenging to say the least.
We had major challenges with three distinct problems:
A common question on the Apache Hadoop mailing lists is what’s going on with availability? This post takes a look at availability in the context of Hadoop, gives an overview of the work in progress and where things are headed.
Background
When discussing Hadoop availability people often start with the NameNode since it is a single point of failure (SPOF) in HDFS, and most components in the Hadoop ecosystem (MapReduce, HBase, Pig, Hive etc) rely on HDFS directly, and are therefore limited by its availability. However, Hadoop availability is a larger, more general issue, so it’s helpful to establish some context before diving in.
Availability is the proportion of time a system is functioning [1], which is commonly referred to as “uptime” (vs downtime, when the system is not functioning).
Cloudera is happy to announce the availability of the third update to version 2 of our distribution for Apache Hadoop (CDH2). CDH2 Update 3 contains a number of important fixes like HADOOP-5203, HDFS-1377, MAPREDUCE-1699, MAPREDUCE-1853, and MAPREDUCE-270. Check out the release notes and change log for more details on what’s in this release. You can find the packages and tarballs on our website, or simply update your systems if you are already using our repositories. More instructions can be found in our CDH documentation.
We appreciate feedback! Get in touch with us on the CDH user list, twitter or IRC (#cloudera on freenode.net) and let us know how the update is working for you.
This is a guest repost contributed by Matteo Bertozzi, a Developer at Develer S.r.l.
Apache Hadoop’s SequenceFile provides a persistent data structure for binary key-value pairs. In contrast with other persistent key-value data structures like B-Trees, you can’t seek to a specified key editing, adding or removing it. This file is append-only.
“My library is in the classpath but I still get a Class Not Found exception in a MapReduce job” – If you have this problem this blog is for you.
Java requires third-party and user-defined classes to be on the command line’s “-classpath” option when the JVM is launched. The `hadoop` wrapper shell script does exactly this for you by building the classpath from the core libraries located in /usr/lib/hadoop-0.20/ and /usr/lib/hadoop-0.20/lib/ directories. However, with MapReduce you job’s task attempts are executed on remote nodes. How do you tell a remote machine to include third-party and user-defined classes?
Map-Reduce jobs are executed in separate JVMs on TaskTrackers and sometimes you need to use third-party libraries in the map/reduce task attempts. For example, you might want to access HBase from within your map tasks. One way to do this is to package every class used in the submittable JAR. You will have to unpack the original hbase- and repackage all the classes in your submittable Hadoop jar. Not good. Don’t do this: The version compatibility issues are going to bite you sooner or later.
Guest re-post from Phil Whelan, a large-scale web-services consultant based in Vancouver, BC.

Here I demonstrate, with repeatable steps, how to fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java.
Guest post from Paul Burkhardt, a Research Developer at SRA International, Inc. where he develops large-scale, distributed computing solutions.
Part II
Previously we proposed how we measure the performance in Hadoop MapReduce applications in an effort to better understand the computing efficiency. In this part, we’ll describe some results and illuminate both good and bad characteristics.
We selected our SIFT-M MapReduce application, described in our presentation at Hadoop World 2010 [3], as the candidate algorithm for Node Scalability since it is embarrassingly parallel and is representative of compute-intensive applications where the bulk of work is computation and not data movement. The Terasort MapReduce benchmark is used for the data scalability tests since it has a greater dependence on the distribution of data than the SIFT algorithm. The Terasort MapReduce benchmark is distributed with the Hadoop codebase. The Yahoo implementation gained notoriety for breaking the terabyte sorting benchmark in 2009 for sorting 100 TB in 173 minutes[4].
Guest post from Paul Burkhardt, a Research Developer at SRA International, Inc. where he develops large-scale, distributed computing solutions.
Part I
We were asked by one of our customers to investigate Hadoop MapReduce for solving distributed computing problems. We were particularly interested in how effectively MapReduce applications utilize computing resources. Computing efficiency is important not only for speed-up and scale-out performance but also power consumption. Consider a hypothetical High-Performance Computing (HPC) system of 10,000 nodes running 50% idle at 50 watts per idle node, and assuming 10 cents per kilowatt hour. It would cost $219,000 per year to power just the idle-time. Keeping a large HPC system busy is difficult and requires huge datasets and efficient parallel algorithms. We wanted to analyze Hadoop applications to determine the computing efficiency and gain insight to tuning and optimization of these applications. We installed CDH3 onto a number of different clusters as part of our comparative study. The CDH3 was preferred over the standard Hadoop installation for the recent patches and the support offered by Cloudera. In the first part of this two-part article, we’ll more formally define computing efficiency as it relates to evaluating Hadoop MapReduce applications and describe the performance metrics we gathered for our assessment. The second part will describe our results and conclude with suggestions for improvements and hopefully will instigate further study in Hadoop MapReduce performance analysis.
For distributed computing to be effective there must be sufficient input and parallelism to saturate the compute resources. Saturation in the context of Hadoop MapReduce refers to maintaining the maximum task rate for the duration of the application run. The maximum task rate is defined as the total number of task slots divided by the time to complete a single task. The maximum task rate is therefore the upper-bound on throughput. Our goal is to reach steady-state at peak throughput. A system in steady-state can be modeled using Little’s Law from Queuing Theory. The equation for Little’s Law defines the average throughput for any steady-state system and can be written as,
Fraud has multiple meanings and the term can be easily abused. The definition of fraud has undergone multiple changes throughout the years and is elusive as well as fraud itself. The modern legal definition of fraud usually contains a few elements that have to be proven in court and depends on the state/country. For example, in California, the elements of fraud, which give rise to the fraud cause of action in the California Courts, are: (a) misrepresentation (false representation, concealment, or nondisclosure); (b) knowledge of falsity (or scienter); (c) intent to defraud, i.e., to induce reliance; (d) justifiable reliance; and (e) resulting damage. A more general definition may contain up to 9 elements.
From the statistical or technical perspective, fraud is a rare event that results in a significant financial impact to the organization.
Both definitions emphasize that the event is rare (assuming that most of the population is law-abiding citizens), is intentional (there is no “accidental” fraud), as well as imply a significant damage caused to the defrauded party (otherwise why bother). Fraud detection is difficult from statistical point of view for exactly these reasons: (a) the events are rare and it is difficult to build a predictive model and (b) fraud assumes a real human being behind it and incorporates elements of game theory since the fraudster is often an insider who knows how to game the system.
- Overview
- Downloads
- Learn Hadoop
- Get Support
-
Blog
- Avro (11)
- Careers (10)
- CDH (29)
- Cloudera Manager (10)
- Cloudera's Service And Configuration Manager (6)
- Community (86)
- Connector (6)
- Data Collection (13)
- Distribution (34)
- Flume (6)
- General (237)
- Guest (35)
- Hadoop (146)
- HBase (40)
- HDFS (26)
- Hive (22)
- MapReduce (37)
- Oozie (4)
- Pig (15)
- Sqoop (9)
- Testing (5)
- Training (18)
- Use Case (11)
- Whirr (1)
- ZooKeeper (10)
- Archives by Month
Hadoop was created by 