<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; MapReduce</title>
	<atom:link href="http://www.cloudera.com/blog/category/mapreduce/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Thu, 24 May 2012 17:53:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Apache MRUnit Is Now A Top Level Project</title>
		<link>http://www.cloudera.com/blog/2012/05/apache-mrunit-is-now-a-top-level-project/</link>
		<comments>http://www.cloudera.com/blog/2012/05/apache-mrunit-is-now-a-top-level-project/#comments</comments>
		<pubDate>Thu, 24 May 2012 17:00:24 +0000</pubDate>
		<dc:creator>Brock Noland</dc:creator>
				<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Test MapReduce]]></category>
		<category><![CDATA[Unit Testing]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=15124</guid>
		<description><![CDATA[This posted was originally posted to the Apache Software Foundation MRUnit blog. The Apache MRUnit team has graduated from the Apache Incubator to an Apache TLP (Top Level Project)! MRUnit is a Java library that helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall [...]]]></description>
			<content:encoded><![CDATA[<p><em>This posted was originally posted to the <a href="https://blogs.apache.org/mrunit/entry/apache_mrunit_is_now_a" target="_blank">Apache Software Foundation MRUnit blog</a>.</em></p>
<p>The Apache MRUnit team has graduated from the Apache Incubator to an Apache TLP (Top Level Project)! MRUnit is a Java library that helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they&#39;re deployed to a production system.</p>
<p>In its monthly meeting in May of 2012, the board of Apache Software Foundation (ASF) resolved to grant a Top-Level Project status to Apache MRUnit, thus graduating it from the Incubator. This is a significant milestone in the life of MRUnit, which has come a long way since its inception as a Hadoop Contrib project in <a target="_blank" href="https://issues.apache.org/jira/browse/HADOOP-5518" target="_blank">HADOOP-5518</a> contributed by Aaron Kimball.</p>
<ul style="padding-left:12px">
<li>May 2012 MRUnit graduates from the Incubator to become a TLP</li>
<li>May 2012 Version 0.9.0-incubating released.</li>
<li>April 2012 Dave Beech added as a new committer.</li>
<li>April 2012 Jarek Jarcec Cecho added as a new committer.</li>
<li>April 2012 New website created using the CMS.</li>
<li>March 2012 Version 0.8.1-incubating released.</li>
<li>March 2012 Jim Donofrio added as a new committer.</li>
<li>Feburary 2012 Version 0.8.0-incubating released.</li>
<li>November 2011 Version 0.5.0-incubating released.</li>
<li>October 2011 Brock Noland added as a new committer.</li>
<li>March 2011 Project enters incubation.</li>
<li>April 2009 Doug Cutting commits Aaron&#39;s patch to Hadoop</li>
<li>March 2009 Aaron Kimball contributes MRunit to Hadoop as a contrib project</li>
</ul>
<p>Below is the graduation resolution:</p>
<pre class="code">X. Establish the Apache MRUnit Project

WHEREAS, the Board of Directors deems it to be in the best
interests of the Foundation and consistent with the
Foundation&#39;s purpose to establish a Project Management
Committee charged with the creation and maintenance of
open-source software related to unit testing Apache Hadoop map
reduce jobs for distribution at no charge to the public.

NOW, THEREFORE, BE IT RESOLVED, that a Project Management
Committee (PMC), to be known as the &quot;Apache MRUnit Project&quot;,
be and hereby is established pursuant to Bylaws of the
Foundation; and be it further

RESOLVED, that the Apache MRUnit Project be and hereby is
responsible for the creation and maintenance of software
related to unit testing Apache Hadoop map reduce jobs;
and be it further

RESOLVED, that the office of &quot;Vice President, Apache MRUnit&quot; be
and hereby is created, the person holding such office to
serve at the direction of the Board of Directors as the chair
of the Apache MRUnit Project, and to have primary responsibility
for management of the projects within the scope of
responsibility of the Apache MRUnit Project; and be it further

RESOLVED, that the persons listed immediately below be and
hereby are appointed to serve as the initial members of the
Apache MRUnit Project:

* Brock Noland - brock@apache.org
* Patrick Hunt - phunt@apache.org
* Nigel Daley - nigel@apache.org
* Eric Sammer - esammer@apache.org
* Aaron Kimball - kimballa@apache.org
* Konstantin Boudnik - cos@apache.org
* Garrett Wu - gwu@apache.org
* Jim Donofrio - jdonofrio@apache.org
* Jarek Jarcec Cecho - jarcec@apache.org
* Dave Beech - dbeech@apache.org

NOW, THEREFORE, BE IT FURTHER RESOLVED, that Brock Noland
be appointed to the office of Vice President, Apache MRUnit, to
serve in accordance with and subject to the direction of the
Board of Directors and the Bylaws of the Foundation until
death, resignation, retirement, removal or disqualification,
or until a successor is appointed; and be it further

RESOLVED, that the initial Apache MRUnit PMC be and hereby is
tasked with the creation of a set of bylaws intended to
encourage open development and increased participation in the
Apache MRUnit Project; and be it further

RESOLVED, that the Apache MRUnit Project be and hereby
is tasked with the migration and rationalization of the Apache
Incubator MRUnit podling; and be it further

RESOLVED, that all responsibilities pertaining to the Apache
Incubator MRUnit podling encumbered upon the Apache Incubator
Project are hereafter discharged.
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/apache-mrunit-is-now-a-top-level-project/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MapReduce 2.0 in Hadoop 0.23</title>
		<link>http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/</link>
		<comments>http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/#comments</comments>
		<pubDate>Fri, 24 Feb 2012 13:00:30 +0000</pubDate>
		<dc:creator>Ahmed Radwan</dc:creator>
				<category><![CDATA[CDH]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[hadoop mapreduce]]></category>
		<category><![CDATA[mapreduce 2]]></category>
		<category><![CDATA[mr2]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=10999</guid>
		<description><![CDATA[In Building and Deploying MR2 we presented a brief introduction to MapReduce in Hadoop 0.23 and focused on the steps to set up a single-node cluster. This blog provides developers with architectural details of the new MapReduce design.  Apache Hadoop 0.23 has major improvements over previous releases. Here are a few highlights on the MapReduce [...]]]></description>
			<content:encoded><![CDATA[<p>In <em><a title="Building and Deploying MR2 Blog Post" href="http://www.cloudera.com/blog/2011/11/building-and-deploying-mr2/">Building and Deploying MR2</a></em> we presented a brief introduction to MapReduce in Hadoop 0.23 and focused on the steps to set up a single-node cluster. This blog provides developers with architectural details of the new MapReduce design. </p>
<p>Apache Hadoop 0.23 has major improvements over previous releases. Here are a few highlights on the MapReduce front; note that there are also major HDFS improvements, which are out of scope of this post.</p>
<h2>MapReduce 2.0 (a.k.a. MRv2 or YARN):</h2>
<p>The new architecture divides the two major functions of the JobTracker &#8211; resource management and job life-cycle management &#8211; into separate components:</p>
<ul>
<li>A <strong>ResourceManager (RM)</strong> that manages the global assignment of compute resources to applications.</li>
<li>A per-application <strong>ApplicationMaster (AM)</strong> that manages the application’s life cycle.</li>
</ul>
<p>In Hadoop 0.23, a MapReduce application is a single job in the sense of classic MapReduce, executed by the MapReduce ApplicationMaster.</p>
<p>There is also a per-machine <strong>NodeManager (NM)</strong> that manages the user processes on that machine. The RM and the NM form the computation fabric of the cluster. The design also allows plugging long-running auxiliary services to the NM; these are application-specific services, specified as part of the configuration, and loaded by the NM during startup. For MapReduce applications on YARN, shuffle is a typical auxiliary service loaded by the NMs. Note that, in Hadoop versions prior to 0.23, shuffle was part of the TaskTracker.  </p>
<p>The per-application ApplicationMaster is a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks. In the YARN design, MapReduce is just one application framework; the design permits building and deploying distributed applications using other frameworks. For example, Hadoop 0.23 ships with a Distributed Shell application that permits running a shell script on multiple nodes on the YARN cluster. At the time of writing this blog post, there is also an ongoing development effort to allow running Message Passing Interface (MPI) applications on top of YARN.</p>
<h2>MapReduce 2.0 Design:</h2>
<p>Figure 1 shows a pictorial representation of a YARN cluster. There is a single Resource Manager, which has two main services:</p>
<ul>
<li>A pluggable <strong>Scheduler</strong>, which manages and enforces the resource scheduling policy in the cluster.  Note that, at the time of writing this blog post, there are two schedulers supported in Hadoop 0.23, the default <em>FIFO</em> scheduler and the <em>Capacity</em> scheduler; the <em>Fair</em> Scheduler is not yet supported.</li>
<li>An <strong>Applications Manager (AsM)</strong>, which manages running Application Masters in the cluster, i.e., it is responsible for starting application masters and for monitoring and restarting them on different nodes in case of failures.</li>
</ul>
<p> <img src="http://www.cloudera.com/wp-content/uploads/2012/02/1.png" alt="" width="900" height="732" /></p>
<p align="center">Fig. 1</p>
<p>Figure 1 also shows that there is a NM service running on each node in the cluster. The diagram also shows two AMs (AM<sub>1</sub> and AM<sub>2</sub>). In a YARN cluster at any given time, there will be as many running Application Masters as there are applications (jobs). Each AM manages the application&#8217;s individual tasks (starting, monitoring and restarting in case of failures). The diagram shows AM<sub>1</sub> managing three tasks (containers 1.1, 1.2 and 1.3), while AM<sub>2</sub> manages four tasks (containers 2.1, 2.2, 2.3 and 2.4). Each task runs within a Container on each node. The AM acquires such containers from the RM’s Scheduler before contacting the corresponding NMs to start the application’s individual tasks. These Containers can be roughly compared to Map/Reduce slots in previous Hadoop versions. However the resource allocation model in Hadoop-0.23 is more optimized from a cluster utilization perspective.  </p>
<h2>Resource Allocation Model:</h2>
<p>In earlier Hadoop versions, each node in the cluster was statically assigned the capability of running a predefined number of Map slots and a predefined number of Reduce slots. The slots could not be shared between Maps and Reduces. This static allocation of slots wasn’t optimal since slot requirements vary during the MR job life cycle (typically, there is a demand for Map slots when the job starts, as opposed to the need for Reduce slots towards the end). Practically, in a real cluster, where jobs are randomly submitted and each has its own Map/Reduce slots requirement, having an optimal utilization of the cluster was hard, if not impossible.</p>
<p>The resource allocation model in Hadoop 0.23 addresses such deficiency by providing a more flexible resource modeling. Resources are requested in the form of containers, where each container has a number of non-static attributes. At the time of writing this blog, the only supported attribute was memory (RAM). However, the model is generic and there is intention to add more attributes in future releases (e.g. CPU and network bandwidth). In this new Resource Management model, only a minimum and a maximum for each attribute are defined, and AMs can request containers with attribute values as multiples of these minimums.</p>
<h2>MapReduce 2.0 Main Components:</h2>
<p>In this section, we’ll go through the main components of the new MapReduce architecture in detail to understand the functionality of these components and how they interact with each other.</p>
<ul>
<li><strong>Client – Resource Manager</strong></li>
</ul>
<p>Figure 2 illustrates the initial step for running an appilcation on a YARN cluster. Typically a client communicates with the RM (specifically the Applications Manager component of the RM) to initiates this process. The first step, marked (1) in the diagram, is for the client to notify the Applications Manager of the desire of submitting an application, this is done via a “New Application Request”. The RM respose, marked (2), will typically contain a newly generated unique application ID, in addition to information about cluster resource capabilities that the client will need in requesting resources for running the application’s AM.</p>
<p>Using the information received from the RM, the client can construct and submit an “Application Submission Context”, marked (3), which typically contains information like scheduler queue, priority and user information, in addition to information needed by the RM to be able to launch the AM. This information is contained in a “Container Launch Context”, which contains the application’s jar, job files, security tokens and any resource requirements.</p>
<p><img src="http://www.cloudera.com/wp-content/uploads/2012/02/2.png" alt="" width="901" height="876" /><br clear="ALL" /> <strong></strong></p>
<p align="center">Fig. 2</p>
<p>Following application submission, the client can query the RM for application reports, receive such reports and, if needed, the client can also ask the RM to kill the application. These three additional steps are pictorially depicted in fig. 3.</p>
<p align="center"> <img src="http://www.cloudera.com/wp-content/uploads/2012/02/3.png" alt="" width="878" height="728" /></p>
<p align="center">Fig. 3<strong></strong></p>
<ul>
<li><strong>Resource Manager – Application Master</strong></li>
</ul>
<p>When the RM receives the application submission context from the client, it finds an available container meeting the resource requirements for running the AM, and it contacts the NM for the container to start the AM process on this node. Figure 4 depicts the following communication steps between the AM and the RM (specifically the Scheduler component of the RM). The first step, marked (1) in the diagram, is for the AM to register itself with the RM. This step consists of a handshaking procedure and also conveys information like the RPC port that the AM will be listening on, the tracking URL for monitoring the application’s status and progress, etc.</p>
<p>The RM registration response, marked (2), will convey essential information for the AM master like minimum and maximum resource capabilities for this cluster. The AM will use such information in calculating and requesting any resource requests for the application’s individual tasks. The resource allocation request from the AM to the RM, marked (3), mainly contains a list of requested containers, and may also contain a list of released containers by this AM. Heartbeat and progress information are also relayed through resource allocation requests as shown by arrow (4).</p>
<p>When the Scheduler component of the RM receives a resource allocation request, it computes, based on the scheduling policy, a list of containers that satisfy the request and sends back an allocation response, marked (5), which contains a list of allocated resources. Using the resource list, the AM starts contacting the associated node managers (as will be soon seen), and finally, as depicted by arrow (6), when the job finishes, the AM sends a Finish Application message to the Resource Manager and exits.</p>
<p align="center"><img src="http://www.cloudera.com/wp-content/uploads/2012/02/4.png" alt="" width="901" height="617" /><br clear="ALL" /> Fig. 4</p>
<ul>
<li><strong>Application Master – Container Manager</strong></li>
</ul>
<p>Figure 5 describes the communication between the AM and the Node Managers. The AM requests the hosting NM for each container to start it as depicted by arrow (1) in the diagram. While containers are running, the AM can request and receive a container status report as shown in steps (2) and (3), respectively.</p>
<p align="center"><img src="http://www.cloudera.com/wp-content/uploads/2012/02/5.png" alt="" width="900" height="434" /></p>
<p align="center">Fig. 5</p>
<p>Based on the above discussion, a developer writing YARN applications will be mainly concerned with the following interfaces:</p>
<ul>
<li><strong><a href="http://hadoop.apache.org/common/docs/r0.23.0/api/org/apache/hadoop/yarn/api/ClientRMProtocol.html">ClientRMProtocol</a></strong>: Client RM (Fig. 3).<br /> This is the protocol for a client to communicate with the RM to launch a new application (i.e. an AM), check on the status of the application or kill the application.</li>
<li><strong><a href="http://hadoop.apache.org/common/docs/r0.23.0/api/org/apache/hadoop/yarn/api/AMRMProtocol.html">AMRMProtocol</a></strong>: AM RM (Fig. 4).<br /> This is the protocol used by the AM to register/unregister itself with the RM, as well as to request resources from the RM Scheduler to run its tasks.</li>
<li><strong><a href="http://hadoop.apache.org/common/docs/r0.23.0/api/org/apache/hadoop/yarn/api/ContainerManager.html">ContainerManager</a></strong>: AM NM (Fig. 5).<br /> This is the protocol used by the AM to communicate with the NM to start or stop containers and to get status updates on its containers.</li>
</ul>
<h2>Migrating older MapReduce applications to run on Hadoop 0.23:</h2>
<p>All client-facing MapReduce interfaces are unchanged, which means that there is no need to make any source code changes to run on top of Hadoop 0.23.</p>
<h2>Useful links:</h2>
<ul>
<li><a href="http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">How to write a YARN Application</a>.</li>
<li><a href="http://hadoop.apache.org/common/docs/r0.23.0/api/index.html">Hadoop 0.23.0 Javadocs</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Using Apache Hadoop to Find Signal in the Noise: Analyzing Adverse Drug Events</title>
		<link>http://www.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-events/</link>
		<comments>http://www.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-events/#comments</comments>
		<pubDate>Wed, 16 Nov 2011 17:54:58 +0000</pubDate>
		<dc:creator>Josh Wills</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[Use Case]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9314</guid>
		<description><![CDATA[Last month at the Web 2.0 Summit in San Francisco, Cloudera CEO Mike Olson&#160;presented some work the Cloudera Data Science Team did to analyze adverse drug events. We decided to share more detail about this project because it demonstrates how to use a variety of open-source tools &#8211;&#160;R, Gephi, and Cloudera&#8217;s Distribution Including Apache Hadoop [...]]]></description>
			<content:encoded><![CDATA[<p>Last month at the Web 2.0 Summit in San Francisco, Cloudera CEO Mike Olson&#160;<a href="http://www.informationweek.com/video/1227036510001" target="_blank">presented</a> some work the Cloudera Data Science Team did to analyze adverse drug events. We decided to share more detail about this project because it demonstrates how to use a variety of open-source tools &#8211;&#160;<a href="http://www.r-project.org/">R</a>, <a href="http://gephi.org/">Gephi</a>, and <a href="http://www.cloudera.com/hadoop/">Cloudera&#8217;s Distribution Including Apache Hadoop</a> (CDH) &#8211; to solve an old problem in a new way.</p>
<h1 style="text-align: left">Background: Adverse Drug Events</h1>
<p style="text-align: left">An adverse drug event (ADE) is an unwanted or unintended reaction that results from the normal use of one or more medications. The consequences of ADEs range from mild allergic reactions to death, with <a href="http://www.ahrq.gov/qual/aderia/aderia.htm#14">one study</a> estimating that 9.7% of adverse drug events lead to permanent disability. <a href="http://www.ahrq.gov/qual/aderia/aderia.htm#1">Another study</a> showed that each patient who experiences an ADE remains hospitalized for an additional 1-5 days and costs the hospital up to $9,000.</p>
<p style="text-align: left">Some adverse drug events are caused by drug interactions, where two or more prescription or over-the-counter (OTC) drugs taken together leads to an unexpected outcome. As the population ages and more patients are treated for multiple health conditions, the risk of ADEs from drug interactions increases. In the United States, roughly <a href="http://jama.ama-assn.org/content/300/24/2867.short">4% of adults older than 55 are at risk for a major drug interaction</a>.</p>
<p style="text-align: left">Because clinical trials study a relatively small number of patients, both regulatory agencies and pharmaceutical companies maintain databases in order to track adverse events that occur after drugs have been approved for market. In the United States, the <a href="http://www.fda.gov/">FDA</a> uses the <a href="http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/default.htm">Adverse Event Reporting System</a> (AERS), where healthcare professionals and consumers may report the details of ADEs they experienced. &#160;The FDA makes a well-formatted <a href="http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm082193.htm" target="_blank">sample of the reports</a> available&#160;for download from their website, to the benefit of data scientists everywhere.</p>
<h1 style="text-align: left">Methodology</h1>
<p style="text-align: left">Identifying ADEs is primarily a <a href="http://www.quora.com/What-are-some-good-resources-for-learning-about-signal-estimation-and-detection" target="_blank">signal detection problem</a>: we have a collection of events, where each event has multiple attributes (in this case, the drugs the patient was taking) and multiple outcomes (the adverse reactions that the patient experienced), and we would like to understand how the attributes correlate with the outcomes. One simple technique for analyzing these relationships is a <a href="http://en.wikipedia.org/wiki/Contingency_table" target="_blank">2&#215;2 contingency table</a>:</p>
<table style="text-align: center" border="0" cellspacing="0" cellpadding="0" width="681">
<col span="4" width="170"></col>
<tbody>
<tr>
<td width="170" height="42">
<p style="text-align: center">For All Drugs/Reactions:</p>
</td>
<td width="170">
<p>Reaction = R<sub>j</sub></p>
</td>
<td width="170">
<p>Reaction != R<sub>j</sub></p>
</td>
<td width="170">
<p>Total</p>
</td>
</tr>
<tr>
<td width="170" height="42">
<p style="text-align: center">Drug = D<sub>i</sub></p>
</td>
<td width="170">
<p>A</p>
</td>
<td width="170">
<p>B</p>
</td>
<td width="170">
<p>A + B</p>
</td>
</tr>
<tr>
<td width="170" height="42">
<p style="text-align: center">Drug != D<sub>i</sub></p>
</td>
<td width="170">
<p>C</p>
</td>
<td width="170">
<p>D</p>
</td>
<td width="170">
<p>C + D</p>
</td>
</tr>
<tr>
<td width="170" height="42">
<p>Total</p>
</td>
<td width="170">
<p>A + C</p>
</td>
<td width="170">
<p>B + D</p>
</td>
<td width="170">
<p style="text-align: center">A + B + C + D</p>
</td>
</tr>
</tbody>
</table>
<p style="text-align: center">&#160;</p>
<p style="text-align: left">Based on the values in the cells of the tables, we can compute various <a href="http://www.ncbi.nlm.nih.gov/pubmed/11998548" target="_blank">measures of disproportionality</a> to find drug-reaction pairs that occur more frequently than we would expect if they were independent.</p>
<p style="text-align: left">For this project, we analyzed interactions involving multiple drugs, using a generalization of the contingency table method that is described in the paper, &#8220;<a href="http://dl.acm.org/citation.cfm?id=502526" target="_blank">Empirical bayes screening for multi-item associations</a>&#8221; by DuMouchel and Pregibon. Their model computes a Multi-Item Gamma-Poisson Shrinkage (MGPS) estimator for each combination of drugs and outcomes, and gives us a statistically sound measure of disproportionality even if we only have a handful of observations for a particular combination of drugs. The MGPS model has been used for a variety of signal detection problems across multiple industries, such as identifying fraudulent phone calls, performing market basket analyses and <a href="http://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=2&amp;ved=0CEkQFjAB&amp;url=http%3A%2F%2Fonlinepubs.trb.org%2Fonlinepubs%2FUA%2F111510DuMouchel.pdf&amp;ei=8Wu3TpqTAqHliAKsl8Vj&amp;usg=AFQjCNGX7xhIX1hv_cs2_OMjlAvzkqmkkg" target="_blank">analyzing defects in automobiles</a>.</p>
<h1>Solving the Hard Problem with Apache Hadoop</h1>
<p>At first glance, it doesn&#8217;t seem like we would need anything beyond a laptop to analyze ADEs, since the FDA only receives about one million reports a year. But when we begin to examine these reports, we discover a problem that is similar to what happens when we attempt to teach computers to play chess: a <a href="http://en.wikipedia.org/wiki/Combinatorial_explosion" target="_blank">combinatorial explosion</a> in the number of possible drug interactions we must consider. Even restricting ourselves to analyzing pairs of drugs, there are more than 3 <em>trillion </em>potential drug-drug-reaction triples in the AERS dataset, and tens of millions of triples that we actually see in the data. Even including the iterative <a href="http://www.seanborman.com/publications/EM_algorithm.pdf" target="_blank">Expectation Maximization algorithm</a> that we use to fit the MGPS model, the total runtime of our analysis is dominated by the process of counting how often the various interactions occur.</p>
<p>The good news is that MapReduce running on a Hadoop cluster is ideal for this problem. By creating a pipeline of MapReduce jobs to clean, aggregate, and join our data, we can parallelize the counting problem across multiple machines to achieve a linear speedup in our overall runtime. The faster runtime for each individual analysis allows us to iterate rapidly on smaller models and tackle larger problems involving more drug interactions than anyone has ever looked at before.</p>
<h1>Visualizing the Results</h1>
<p>The output of our analysis is a collection of drug-drug-reaction triples that have very large disproportionality scores. But as we all know, <a href="http://xkcd.com/552/" target="_blank">correlation is not causation</a>. The output of our analysis provides us with useful information that should be filtered and evaluated by domain experts and used as the basis for further study using controlled experiments.</p>
<p>With that caveat in mind, our analysis revealed a few drug pairs with surprisingly high correlations with adverse events that did not show up in a search of the academic literature: gabapentin (a seizure medication) taken in conjunction with hydrocodone/paracetamol&#160;was correlated with memory impairment, and <a href="http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0000604/" target="_blank">haloperidol</a> in conjunction with <a href="http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0000560/" target="_blank">lorazepam</a> was correlated with the patient entering into a coma.</p>
<p>Even with restrictive filters applied to the drug-drug-reaction triples, we still end up with tens of thousands of triples that score high enough to merit further investigation. In addition to looking at individual triples, we can also use graph visualization tools like Gephi to explore the macro-level structure of the data. Gephi has a number of powerful layout algorithms and filtering tools that allow us to impose structure on an undifferentiated mass of data points.&#160;Here is a graph in which the vertices are drugs and the thickness of the edges represent the number of high scoring adverse reactions that feature each pair of drugs:</p>
<p><a href="https://www.cloudera.com/wp-content/uploads/2011/11/wholegraph1.png"><img class="alignnone size-full wp-image-9393" src="https://www.cloudera.com/wp-content/uploads/2011/11/wholegraph1.png" alt="" width="1420" height="811" /></a></p>
<p><br class="spacer_" /></p>
<p>We can also pan and zoom to different regions of the graph and highlight clusters of drug interactions. Here is a cluster of drugs that are used in treating HIV:</p>
<p><a href="https://www.cloudera.com/wp-content/uploads/2011/11/hivcluster.png"><img class="size-full wp-image-9395" src="https://www.cloudera.com/wp-content/uploads/2011/11/hivcluster.png" alt="A cluster of HIV-related drugs" width="1420" height="811" /></a></p>
<p><br class="spacer_" /></p>
<p>And here is a cluster of drugs that are used to fight cancer:</p>
<p><a href="https://www.cloudera.com/wp-content/uploads/2011/11/cancercluster.png"><img class="size-full wp-image-9396" src="https://www.cloudera.com/wp-content/uploads/2011/11/cancercluster.png" alt="A cluster of cancer-related drugs" width="1297" height="811" /></a></p>
<p><br class="spacer_" /></p>
<p><span style="font-weight: normal">The combination of Apache Hadoop, R, and Gephi changes the way we think about analyzing adverse drug events. Instead of focusing on a handful of outcomes, we can process all of the events in the data set at the same time. We can try out hundreds of different strategies for cleaning records, stratifying observations into clusters, and scoring drug-reaction tuples, run everything in parallel, and analyze the data at a fraction of the cost of a traditional supercomputer. We can render the results of our analyses using visualization tools that can be used by domain experts to explore relationships within our data that they might never have thought to look for. By dramatically reducing the costs of exploration and experimentation, we foster an environment that enables innovation and discovery.</span></p>
<h1><strong>Open Data, Open Analysis</strong></h1>
<p>This project was possible because the <a href="http://www.fda.gov/Drugs/default.htm" target="_blank">FDA&#8217;s Center for Drug Evaluation and Research</a> makes a portion of their data open and available to anyone who wants to <a href="http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm082193.htm" target="_blank">download it</a>. In turn, we are releasing a well-commented version of the code we used to analyze that data &#8211;&#160;a mixture of Java, Pig, R, and Python &#8211;&#160;on the <a href="http://github.com/cloudera/ades" target="_blank">Cloudera github repository</a> under the&#160;<a href="http://www.apache.org/licenses/LICENSE-2.0.html" target="_blank">Apache License</a>. We also contributed the most useful Pig function developed for this project, which&#160;<a href="http://code.google.com/p/szl/source/browse/trunk/src/emitters/szlcomputequantiles.cc" target="_blank">computes approximate quantiles for a stream of&#160;numbers</a>, to LinkedIn&#8217;s <a href="https://github.com/linkedin/datafu" target="_blank">datafu</a> library. We hope to collaborate with the community to improve the models over time and create a resource for students, researchers, and fellow data scientists.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-events/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Building and Deploying MR2</title>
		<link>http://www.cloudera.com/blog/2011/11/building-and-deploying-mr2/</link>
		<comments>http://www.cloudera.com/blog/2011/11/building-and-deploying-mr2/#comments</comments>
		<pubDate>Wed, 16 Nov 2011 13:00:24 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[hadoop mapreduce 2]]></category>
		<category><![CDATA[mapreduce 2]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9440</guid>
		<description><![CDATA[A number of architectural changes have been added to Hadoop MapReduce. The new MapReduce system is called MR2 (AKA MR.next). The first release version to include these changes will be Hadoop 0.23. A key change in the new architecture is the disappearance of the centralized JobTracker service. Previously, the JobTracker was responsible for provisioning the [...]]]></description>
			<content:encoded><![CDATA[<p>A number of architectural changes have been added to Hadoop MapReduce. The new MapReduce system is called MR2 (AKA MR.next). The first release version to include these changes will be Hadoop 0.23.</p>
<p>A key change in the new architecture is the disappearance of the centralized JobTracker service. Previously, the JobTracker was responsible for provisioning the resources across the whole cluster, in addition to managing the life cycle of all submitted MapReduce applications; this typically included starting, monitoring and retrying the applications individual tasks. Throughout the years and from a practical perspective, the Hadoop community has acknowledged the problems that inherently exist in this functionally aggregated design (See MAPREDUCE-279).</p>
<p>In MR2, the JobTracker aggregated functionality is separated across two new components:</p>
<ol>
<li><strong>Central Resource Manager (RM)</strong>: Management of resources in the cluster.</li>
<li><strong>Application Master (AM)</strong>: Management of the life cycle of an application and its tasks. Think of the      AM as a per-application JobTracker.</li>
</ol>
<p>The new design enables scaling Hadoop to run on much larger clusters, in addition to the ability to run non-mapreduce applications on the same Hadoop cluster. For more architecture details, the interested reader may refer to the design document at: https://issues.apache.org/jira/secure/attachment/12486023/MapReduce_NextGen_Architecture.pdf.</p>
<p>The objective of this blog is to outline the steps for building, configuring, deploying and running a single-node NextGen MR cluster.</p>
<p>In the following steps, I&#8217;ve chosen to use ~/mr2 as my working directory. Inside this directory, we&#8217;ll create a source directory for the code we&#8217;ll soon checkout, and a deploy directory for our deployment.</p>
<pre class="code">ahmed@ubuntu:~$ cd ~/mr2
ahmed@ubuntu:~/mr2$ mkdir source
ahmed@ubuntu:~/mr2$ mkdir deploy</pre>
<p>Make sure protbuf is in your library path or:</p>
<pre class="code">ahmed@ubuntu:~/mr2$ export LD_LIBRARY_PATH=/usr/local/lib</pre>
<p>We&#8217;ll now checkout the source code from the apache git repository.</p>
<pre class="code">ahmed@ubuntu:~/mr2$ cd source
ahmed@ubuntu:~/mr2/source$ git clone git://git.apache.org/hadoop-common.git
Cloning into hadoop-common...
.
.
ahmed@ubuntu:~/mr2/source$ cd hadoop-common/
ahmed@ubuntu:~/mr2/source/hadoop-common$ git branch
* trunk</pre>
<p>Create the deployment tar files.</p>
<pre class="code">ahmed@ubuntu:~/mr2/source/hadoop-common$ mvn package -Pdist -Dtar -DskipTests</pre>
<p>Copy the created Hadoop tar file to our deploy directory and Untar it.</p>
<pre class="code">ahmed@ubuntu:~/mr2/source/hadoop-common$ cp ./hadoop-dist/target/hadoop-0.24.0-SNAPSHOT.tar.gz ../../deploy/.
ahmed@ubuntu:~/mr2/deploy$ tar -xzvf hadoop-0.24.0-SNAPSHOT.tar.gz
ahmed@ubuntu:~/mr2/deploy$ cd hadoop-0.24.0-SNAPSHOT/</pre>
<p>Export some needed environment variables. See the following listing:</p>
<pre class="code">#!/bin/bash
export HADOOP_DEV_HOME=`pwd`
export HADOOP_MAPRED_HOME=${HADOOP_DEV_HOME}
export HADOOP_COMMON_HOME=${HADOOP_DEV_HOME}
export HADOOP_HDFS_HOME=${HADOOP_DEV_HOME}
export YARN_HOME=${HADOOP_DEV_HOME}
export HADOOP_CONF_DIR=${HADOOP_DEV_HOME}/conf/
export YARN_CONF_DIR=~${HADOOP_DEV_HOME}/conf/</pre>
<p>We&#8217;ll start our configuration; the configuration directory will contain the following files: core-site.xml, hdfs-site.xml, mapred-site.xml, slaves,&#160; yarn-env.sh and yarn-site.xml.</p>
<p>Make sure the contents of the xml files in the conf directory are as follows:</p>
<pre class="code">ahmed@ubuntu:~/mr2/deploy/ hadoop-0.24.0-SNAPSHOT/conf$ cat yarn-site.xml</pre>
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="23" valign="top"><span style="font-size: 11pt; font-family: courier;">1<br />
 2<br />
 3<br />
 4<br />
 5<br />
 6<br />
 7<br />
 8<br />
 9<br />
 10<br />
 11<br />
 12</span></td>
<td width="756" valign="top" style="font-size:11; font-family: courier">
<span style="font-size:11pt; font-family: courier new">&lt;?xml   version=&#8221;1.0&#8243;?&gt;<br />
&lt;configuration&gt;<br />
&lt;!&#8211; Site specific   YARN configuration properties &#8211;&gt;<br />
&lt;property&gt;<br />
&lt;name&gt;yarn.nodemanager.aux-services&lt;/name&gt;<br />
&lt;value&gt;mapreduce.shuffle&lt;/value&gt;<br />
&lt;/property&gt;<br />
&lt;property&gt;<br />
&lt;name&gt;yarn.nodemanager.aux-services.mapreduce.shuffle.class&lt;/name&gt;<br />
&lt;value&gt;org.apache.hadoop.mapred.ShuffleHandler&lt;/value&gt;<br />
&lt;/property&gt;<br />
&lt;/configuration&gt;</span></td>
</tr>
</tbody>
</table>
<pre class="code">ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT/conf$ cat core-site.xml</pre>
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="23" valign="top"><span style="font-size: 11pt; font-family: courier;">1<br />
 2<br />
 3<br />
 4<br />
 5<br />
 6<br />
 7<br />
 8<br />
 </span></td>
<td width="756" valign="top" style="font-size:11; font-family: courier">
<span style="font-size:11pt; font-family: courier new">&lt;?xml   version=&#8221;1.0&#8243;?&gt;<br />
&lt;?xml-stylesheet   href=&#8221;configuration.xsl&#8221;?&gt;<br />
&lt;configuration&gt;<br />
&lt;property&gt;<br />
&lt;name&gt;fs.default.name&lt;/name&gt;<br />
&lt;value&gt;hdfs://localhost:9000&lt;/value&gt;<br />
&lt;/property&gt;<br />
&lt;/configuration&gt;</span>
</td>
</tr>
</tbody>
</table>
<pre class="code">ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT/conf$ cat mapred-site.xml</pre>
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="23" valign="top"><span style="font-size: 11pt; font-family: courier;">1<br />
 2<br />
 3<br />
 4<br />
 5<br />
 6<br />
 7<br />
 8<br />
 </span></td>
<td width="756" valign="top" style="font-size:11; font-family: courier">
<span style="font-size:11pt; font-family: courier new">&lt;?xml   version=&#8221;1.0&#8243;?&gt;<br />
&lt;?xml-stylesheet   href=&#8221;configuration.xsl&#8221;?&gt;<br />
&lt;configuration&gt;<br />
&lt;property&gt;<br />
&lt;name&gt;   mapreduce.framework.name&lt;/name&gt;<br />
&lt;value&gt;yarn&lt;/value&gt;<br />
&lt;/property&gt;<br />
&lt;/configuration&gt;</span></td>
</tr>
</tbody>
</table>
<pre class="code">ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT/conf$ cat hdfs-site.xml</pre>
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="23" valign="top"><span style="font-size: 11pt; font-family: courier;">1<br />
 2<br />
 3<br />
 4<br />
 5<br />
 6<br />
 7<br />
 8<br />
 9<br />
 10<br />
 11<br />
 12</span></td>
<td width="756" valign="top" style="font-size:11; font-family: courier">
<span style="font-size:11pt; font-family: courier new">&lt;?xml   version=&#8221;1.0&#8243;?&gt;<br />
&lt;?xml-stylesheet   href=&#8221;configuration.xsl&#8221;?&gt;<br />
&lt;configuration&gt;<br />
&lt;property&gt;<br />
&lt;name&gt;dfs.replication&lt;/name&gt;<br />
&lt;value&gt;1&lt;/value&gt;<br />
&lt;/property&gt;<br />
&lt;property&gt;<br />
&lt;name&gt;dfs.permissions&lt;/name&gt;<br />
&lt;value&gt;false&lt;/value&gt;<br />
&lt;/property&gt;<br />
&lt;/configuration&gt;</span></td>
</tr>
</tbody>
</table>
<p>Now we can format our HDFS namenode as usual:</p>
<pre class="code">ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ bin/hadoop namenode -format</pre>
<p>And start the HDFS services:</p>
<pre class="code">ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ sbin/hadoop-daemon.sh start namenode
ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ sbin/hadoop-daemon.sh start datanode</pre>
<p>And the new MR2 services</p>
<pre class="code">ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ bin/yarn-daemon.sh start resourcemanager
ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ bin/yarn-daemon.sh start nodemanager
ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ bin/yarn-daemon.sh start historyserver</pre>
<p>Make sure all needed services are running:</p>
<pre class="code">ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ jps
7623 Jps
7433 ResourceManager
7587 JobHistoryServer
7495 NodeManager
7325 DataNode
7250 NameNode</pre>
<p>You can see the resource manager web console (shown below) using this address: <a href="http://localhost:8088/" target="_blank">http://localhost:8088</a></p>
<p><img title="Apache Hadoop Applications" src="https://www.cloudera.com/wp-content/uploads/2011/11/Hadoop-Applications.png" alt="" /></p>
<p>Our usual namenode web console should be also up:</p>
<p><img title="NameNode 'localhost.localdomain:9000'" src="https://www.cloudera.com/wp-content/uploads/2011/11/NameNode-localhost.png" alt="" /></p>
<p>We can now start running some example jobs, but before that, we need to create the examples jar as follows:</p>
<pre class="code">ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ cd
/home/ahmed/mr2/source/hadoop-common/hadoop-mapreduce-project/
ahmed@ubuntu:~/mr2/source/hadoop-common/hadoop-mapreduce-project$ ant examples -Dresolvers=internal</pre>
<p>Here is how we submit our first job; the randomwriter from our examples jar:</p>
<pre class="code">ahmed@ubuntu:~/mr2/source/hadoop-common/hadoop-mapreduce-project$ cd $HADOOP_MAPRED_HOME
ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ $HADOOP_COMMON_HOME/bin/hadoop jar ~/mr2/source/hadoop-common/hadoop-mapreduce-project/build/hadoop-mapreduce-examples-0.24.0-SNAPSHOT.jar randomwriter -Dmapreduce.job.user.name=$USER -
Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory -Dmapreduce.randomwriter.bytespermap=10000 -Ddfs.blocksize=536870912 -Ddfs.block.size=536870912 -libjars $YARN_HOME/modules/hadoop-mapreduce-client-jobclient-0.24.0-SNAPSHOT.jar output</pre>
<p>If everything goes fine, you&#8217;ll see the job console output and finally:</p>
<pre class="code">2011-09-30 14:47:47,688 INFO&#160; mapreduce.Job (Job.java:monitorAndPrintJob(1245)) - Counters: 28
File System Counters
FILE: BYTES_READ=1200
FILE: BYTES_WRITTEN=437730
FILE: READ_OPS=0
FILE: LARGE_READ_OPS=0
FILE: WRITE_OPS=0
HDFS: BYTES_READ=1180
HDFS: BYTES_WRITTEN=150089
HDFS: READ_OPS=70
HDFS: LARGE_READ_OPS=0
HDFS: WRITE_OPS=40
org.apache.hadoop.mapreduce.JobCounter
TOTAL_LAUNCHED_MAPS=10
OTHER_LOCAL_MAPS=10
SLOTS_MILLIS_MAPS=138095
org.apache.hadoop.mapreduce.TaskCounter
MAP_INPUT_RECORDS=10
MAP_OUTPUT_RECORDS=20
SPLIT_RAW_BYTES=1180
SPILLED_RECORDS=0
FAILED_SHUFFLE=0
MERGED_MAP_OUTPUTS=0
GC_TIME_MILLIS=785
CPU_MILLISECONDS=4180
PHYSICAL_MEMORY_BYTES=491077632
VIRTUAL_MEMORY_BYTES=3847532544
COMMITTED_HEAP_BYTES=162529280
org.apache.hadoop.examples.RandomWriter$Counters
BYTES_WRITTEN=148669
RECORDS_WRITTEN=20
org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
BYTES_READ=0
org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
BYTES_WRITTEN=150089
Job ended: Fri Sep 30 14:47:47 PDT 2011
The job took 53 seconds.</pre>
<p>We&#8217;ll now run the conventional wordcount example job, to see a full map &amp; reduce job (as opposed to the randomwriter map-only job):</p>
<pre class="code">ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ $HADOOP_COMMON_HOME/bin/hadoop jar ~/mr2/source/hadoop-common/hadoop-mapreduce-project/build/hadoop-mapreduce-examples-0.24.0-SNAPSHOT.jar wordcount -Dmapreduce.job.user.name=$USER -Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory -libjars $YARN_HOME/modules/hadoop-mapreduce-client-jobclient-0.24.0-SNAPSHOT.jar input output2</pre>
<p>The job counters output after successful run:</p>
<pre class="code">File System Counters
FILE: BYTES_READ=384
FILE: BYTES_WRITTEN=87759
FILE: READ_OPS=0
FILE: LARGE_READ_OPS=0
FILE: WRITE_OPS=0
HDFS: BYTES_READ=144
HDFS: BYTES_WRITTEN=46
HDFS: READ_OPS=9
HDFS: LARGE_READ_OPS=0
HDFS: WRITE_OPS=4
org.apache.hadoop.mapreduce.JobCounter
TOTAL_LAUNCHED_MAPS=1
TOTAL_LAUNCHED_REDUCES=1
DATA_LOCAL_MAPS=1
SLOTS_MILLIS_MAPS=2331
SLOTS_MILLIS_REDUCES=2353
org.apache.hadoop.mapreduce.TaskCounter
MAP_INPUT_RECORDS=5
MAP_OUTPUT_RECORDS=5
MAP_OUTPUT_BYTES=56
MAP_OUTPUT_MATERIALIZED_BYTES=72
SPLIT_RAW_BYTES=108
COMBINE_INPUT_RECORDS=5
COMBINE_OUTPUT_RECORDS=5
REDUCE_INPUT_GROUPS=5
REDUCE_SHUFFLE_BYTES=72
REDUCE_INPUT_RECORDS=5
REDUCE_OUTPUT_RECORDS=5
SPILLED_RECORDS=10
SHUFFLED_MAPS=1
FAILED_SHUFFLE=0
MERGED_MAP_OUTPUTS=1
GC_TIME_MILLIS=128
CPU_MILLISECONDS=1270
PHYSICAL_MEMORY_BYTES=212226048
VIRTUAL_MEMORY_BYTES=770711552
COMMITTED_HEAP_BYTES=137433088
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
BYTES_READ=36
org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
BYTES_WRITTEN=46
</pre>
<p>MR2 comes with new and updated web consoles that can be used to conveniently explore and monitor the system services. The Applications tab shows the submitted applications, the application&#8217;s id, user name, queue name, status, progress and other useful information. For example, here is the resource manager Applications snapshot when randomwriter and wordcount are running.</p>
<p><img title="Hadoop Applications" src="https://www.cloudera.com/wp-content/uploads/2011/11/Hadoop-Applications-2.png" alt="" /></p>
<p>The nodes tab shows the nodes of the cluster, their addresses, in addition to health and container information. For example, here is the nodes view for our single node cluster.</p>
<p><img title="Hadoop Nodes of the Cluster" src="https://www.cloudera.com/wp-content/uploads/2011/11/Hadoop-Nodes-of-Cluster.png" alt="" /></p>
<p>The scheduler view shows useful scheduling information. In our example, we used the default FifoScheduler. And, as seen in the following snapshot, the view shows information like the queue minimum and maximum capacities, number of nodes, total and available capacities and other useful information. This snapshot was captured after running the randomwriter application but before submitting the wordcount application.</p>
<p><img title="Hadoop Default Scheduler" src="https://www.cloudera.com/wp-content/uploads/2011/11/Default-Scheduler.png" alt="" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/11/building-and-deploying-mr2/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Hadoop World 2011: A Glimpse into Development</title>
		<link>http://www.cloudera.com/blog/2011/10/hadoop-world-2011-a-glimpse-into-development/</link>
		<comments>http://www.cloudera.com/blog/2011/10/hadoop-world-2011-a-glimpse-into-development/#comments</comments>
		<pubDate>Wed, 12 Oct 2011 13:00:42 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[careers]]></category>
		<category><![CDATA[CDH]]></category>
		<category><![CDATA[Cloudera's Service and Configuration Manager]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[Connector]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[Flume]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[oozie]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[sqoop]]></category>
		<category><![CDATA[training]]></category>
		<category><![CDATA[Use Case]]></category>
		<category><![CDATA[ZooKeeper]]></category>
		<category><![CDATA[hadoop conference]]></category>
		<category><![CDATA[hadoop event]]></category>
		<category><![CDATA[hadoop world]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9240</guid>
		<description><![CDATA[The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hadoopworld.com/"><img style="float: left; padding-right: 20px;" title="Register for Hadoop World" src="https://www.cloudera.com/wp-content/uploads/2010/08/hw_white2.gif" alt="" /></a></p>
<p>The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.</p>
<h2 style="font-size: 14pt; color: #344152;"><a href="http://www.hadoopworld.com/tracks/development-developers/" target="_blank">Preview of Development Track Sessions</a></h2>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Building Web Analytics Processing on Hadoop at CBS Interactive</span></a><br />
 <em>Michael Sun, CBS Interactive</em></p>
<p><strong>Abstract:</strong> CBS Interactive successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties that CBS Interactive oversees. After introducing Lumberjack&#8212;the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release&#8212;Michael will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, CBS Interactive achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).</p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Gateway: Cluster Virtualization Framework</span></a><br />
<em>Konstantin Shvachko, eBay</em></p>
<p><strong>Abstract:</strong> Access to Hadoop clusters through dedicated portal nodes (typically located behind firewalls and performing user authentication and authorization) can have several drawbacks &#8212; as shared multitenant resources they can create contention among users and increase the maintenance overhead for cluster administrators. This session will discuss the Gateway system, a cluster virtualization framework that provides multiple benefits: seamless access from users&#8217; workplace computers through corporate firewalls; the ability to failover to active clusters for scheduled or unscheduled downtime, as well as the ability to redirect traffic to other clusters during upgrades; and user access to clusters running different versions of Hadoop. </p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">SHERPASURFING &#8211; Open Source Cyber Security Solution</span></a><br />
<em>Wayne Wheeles, Novii Design</em></p>
<p><strong>Abstract:</strong> Every day billions of packets, both benign and some malicious, flow in and out of networks. Every day it is an essential task for the modern Defensive Cyber Security Organization to be able to reliably survive the sheer volume of data, bring the NETFLOW data to rest, enrich it, correlate it and perform. SHERPASURFING is an open source platform built on the proven Cloudera&#8217;s Distribution including Apache Hadoop that enables organizations to perform the Cyber Security mission and at scale at an affordable price point. This session will include an overview of the solution and components, followed by a demonstration of analytics. </p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools</span></a><br />
<em>Arvind Prabhakar, Cloudera<br />
Guy Harrison, Quest Software</em></p>
<p><strong>Abstract:</strong> As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative. We&#8217;ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we&#8217;ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc. </p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Next Generation Apache Hadoop MapReduce</span></a><br />
<em>Mahadev Konar, Hortonworks</em></p>
<p><strong>Abstract:</strong> The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale, high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization. We will be presenting the architecture and design of the next generation of map reduce and will delve into the details of the architecture that makes it much easier to innovate. We will also be presenting large scale and small scale comparisons on some benchmarks with MRV1.&#8221; </p>
<p><a href="http://www.hadoopworld.com/"><img title="Register for Hadoop World" src="https://www.cloudera.com/wp-content/uploads/2010/12/registernow.gif" alt="Register for Hadoop World" /></a></p>
<p>There are several <a href="http://www.hadoopworld.com/training/">training classes</a> and <a href="http://www.hadoopworld.com/training/">certification sessions</a> provided surrounding the Hadoop World conference. Don&#8217;t forget to register and become <a href="http://www.hadoopworld.com/training/">Cloudera Certified in Apache Hadoop</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/10/hadoop-world-2011-a-glimpse-into-development/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Introducing Crunch: Easy MapReduce Pipelines for Hadoop</title>
		<link>http://www.cloudera.com/blog/2011/10/introducing-crunch/</link>
		<comments>http://www.cloudera.com/blog/2011/10/introducing-crunch/#comments</comments>
		<pubDate>Mon, 10 Oct 2011 17:05:44 +0000</pubDate>
		<dc:creator>Josh Wills</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9165</guid>
		<description><![CDATA[As a data scientist at Cloudera, I work with customers across a wide range of industries that use Hadoop to solve their business problems. Many of the solutions we create involve multi-stage pipelines of MapReduce jobs that join, clean, aggregate, and analyze enormous amounts of data. When working with log files or relational database tables, [...]]]></description>
			<content:encoded><![CDATA[<p>As a data scientist at Cloudera, I work with customers across a wide range of industries that use Hadoop to solve their business problems. Many of the solutions we create involve multi-stage pipelines of MapReduce jobs that join, clean, aggregate, and analyze enormous amounts of data. When working with log files or relational database tables, we use high-level tools like <a href="http://pig.apache.org/" target="_blank">Pig</a> and <a href="http://hive.apache.org/" target="_blank">Hive</a> for their convenient and powerful support for creating pipelines over structured and semi-structured records.</p>
<p>As Hadoop has spread from web companies to other industries, the variety of data that is stored in HDFS has expanded dramatically. Hadoop clusters are being used to process satellite images, time series data, audio files, and seismograms. These formats are not a natural fit for the data schemas imposed by Pig and Hive, in the same way that structured binary data in a relational database can be a bit awkward to work with. For these use cases, we either end up writing large, custom libraries of user-defined functions in Pig or Hive, or simply give up on our high-level tools and go back to writing MapReduces in Java. Either of these options is a serious drain on developer productivity.</p>
<p>Today, we&#8217;re pleased to introduce <a href="http://github.com/cloudera/crunch" target="_blank">Crunch</a>, a Java library that aims to make writing, testing, and running MapReduce pipelines easy, efficient, and even fun. Crunch&#8217;s design is modeled after <a href="http://dl.acm.org/citation.cfm?id=1806638" target="_blank">Google&#8217;s FlumeJava</a>, focusing on a small set of simple primitive operations and lightweight user-defined functions that can be combined to create complex, multi-stage pipelines. At runtime, Crunch compiles the pipeline into a sequence of MapReduce jobs and manages their execution.</p>
<h2>Example</h2>
<p>Let&#8217;s take a look at the classic WordCount MapReduce, written using Crunch:</p>
<pre>
import com.cloudera.crunch.DoFn;
import com.cloudera.crunch.Emitter;
import com.cloudera.crunch.PCollection;
import com.cloudera.crunch.PTable;
import com.cloudera.crunch.Pipeline;
import com.cloudera.crunch.impl.mr.MRPipeline;
import com.cloudera.crunch.lib.Aggregate;
import com.cloudera.crunch.type.writable.Writables;

public class WordCount {
  public static void main(String[] args) throws Exception {
    // Create an object to coordinate pipeline creation and execution.
    Pipeline pipeline = new MRPipeline(WordCount.class);
    // Reference a given text file as a collection of Strings.
    PCollection&lt;String&gt; lines = pipeline.readTextFile(args[0]);

    // Define a function that splits each line in a PCollection of Strings into a
    // PCollection made up of the individual words in the file.
    PCollection&lt;String&gt; words = lines.parallelDo(new DoFn&lt;String, String&gt;() {
      public void process(String line, Emitter&lt;String&gt; emitter) {
        for (String word : line.split("\\s+")) {
          emitter.emit(word);
        }
      }
    }, Writables.strings()); // Indicates the serialization format

    // The Aggregate.count method applies a series of Crunch primitives and returns
    // a map of the unique words in the input PCollection to their counts.
    // Best of all, the count() function doesn't need to know anything about
    // the kind of data stored in the input PCollection.
    PTable&lt;String, Long&gt; counts = Aggregate.count(words);

    // Instruct the pipeline to write the resulting counts to a text file.
    pipeline.writeTextFile(counts, args[1]);
    // Execute the pipeline as a MapReduce.
    pipeline.done();
  }
}</pre>
<p></p>
<h2>Advantages</h2>
<ol>
<li><strong>It&#8217;s just Java.</strong> Crunch shares a core philosophical belief with Google&#8217;s FlumeJava: <i>novelty is the enemy of adoption</i>. For developers, learning a Java library requires much less up-front investment than learning a new programming language. Crunch provides full access to the power of Java for writing functions, managing pipeline execution, and dynamically constructing new pipelines, obviating the need to switch back and forth between a data flow language and a real programming language.</li>
<li><strong>Natural type system.</strong> Crunch supports reading and writing data that is stored using Hadoop&#8217;s Writable format or <a href="http://avro.apache.org/" target="_blank">Apache Avro</a> records. You do not need to write code that maps data stored in these formats into Crunch&#8217;s type system&#8211; they are supported natively. You can even mix and match Writable and Avro types within a single MapReduce: changing the <code>Writables.strings()</code> call to <code>Avros.strings()</code> in the WordCount example will run the MapReduce using Avro serialization instead of Writables.</li>
<li><strong>A modular library released under the Apache License.</strong> Experts in machine learning, text mining, and ETL can craft libraries using Crunch&#8217;s data model, and other developers can use those libraries to build custom pipelines that operate on their data. For example, Crunch can be used to create the glue code that converts raw data into the structured input that a machine learning algorithm expects, and Crunch will compile the glue code and the machine learning algorithm into a single MapReduce.</li>
</ol>
<h2> Future Work</h2>
<p> We are releasing Crunch as a development project, not a product. We&#8217;re eager for developers to play with it and tell us what they like and what they dislike. You can get started with Crunch by downloading it from Cloudera&#8217;s github repository <a href="https://github.com/cloudera/crunch" target="_blank">here</a>.</p>
<p>We have tested the library on a number of our use cases, but there will be bugs and rough edges that we will work out in the coming months. We gladly welcome contributions from the Hadoop ecosystem to help us improve Crunch as we prepare it for submission to the Apache Incubator, especially around:</p>
<ul>
<li>More efficient MapReduce compilation, including cost-based optimization,</li>
<li>Support for HBase and HCatalog as data sources/targets,</li>
<li>Tools and examples that build Crunch pipelines in other JVM languages, such as Scala, JRuby, Clojure, and Jython.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/10/introducing-crunch/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>CDH3 Update 1 Released</title>
		<link>http://www.cloudera.com/blog/2011/07/cdh3u1-released/</link>
		<comments>http://www.cloudera.com/blog/2011/07/cdh3u1-released/#comments</comments>
		<pubDate>Fri, 22 Jul 2011 19:00:05 +0000</pubDate>
		<dc:creator>Charles Zedlewski</dc:creator>
				<category><![CDATA[CDH]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[oozie]]></category>
		<category><![CDATA[sqoop]]></category>
		<category><![CDATA[apache hadoop]]></category>
		<category><![CDATA[CDH update]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[Hue]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=8347</guid>
		<description><![CDATA[Announcing an update to CDH3.]]></description>
			<content:encoded><![CDATA[<p>Continuing with our practice from Cloudera&#8217;s Distribution Including Apache Hadoop v2 (CDH2), our goal is to provide regular (quarterly), predictable updates to the generally available release of our open source distribution.&#160; For CDH3 the first such update is available today, approximately 3 months from when CDH3 went GA.</p>
<p>For those of you who are recent Cloudera users, here is a refresh on our update policy:</p>
<ul>
<li>We will only include patches in updates that are non-compatibility breaking.</li>
<li>We will only include patches in updates that are non-disruptive.</li>
<li>You can skip updates without penalty &#8211; i.e., if you don&#8217;t find the contents of an update compelling, you can skip it and wait for a future update without having to do a delta upgrade.</li>
</ul>
<p>There is one new addition to our update policy going forward: when it&#8217;s possible to pull features from our CDH4 roadmap into CDH3 updates in a non-disruptive way, we&#8217;ll take advantage of that opportunity.</p>
<p>With all that said, there are a number of improvements coming to CDH3 with update 1. &#160;Among them are:</p>
<ol>
<li>New features &#8211; integrated Apache-compatible licensed fast compression throughout CDH, web shell for Hue, Flume / HBase integration, Fair Scheduler ACL&#8217;s, improved datanode handling of hard drive failures, and email actions and date formatting for Oozie.</li>
<li>Improvements (stability and performance) &#8211; HBase bulk loading, Namenode stability, Fuse-DFS (mountable HDFS).</li>
<li>New component versions &#8211; Hive 0.7.1, Pig 0.8.1, Hbase 0.90.3, Flume 0.9.4 and Sqoop 1.3.</li>
<li>Bug fixes &#8211; 80+ bug fixes. &#160;Per our standard practice, the enumerated fixes and their corresponding Apache project jiras are provided in the release notes. </li>
</ol>
<p>Update 1 is available in all the usual formats (RHEL, SLES, Ubuntu, Debian packages, tarballs, and SCM Express). &#160;Check out <a href="https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation">the installation docs</a> for instructions. If you&#8217;re running components from the Cloudera Management Suite they will not be impacted by moving to update 1.  The next update (update 2) for CDH3 is planned for mid-October.</p>
<p>Thank you for supporting Apache Hadoop and thank you for supporting Cloudera.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/07/cdh3u1-released/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Migrating from Elastic MapReduce to a Cloudera&#8217;s Distribution including Apache Hadoop Cluster</title>
		<link>http://www.cloudera.com/blog/2011/06/migrating-from-elastic-mapreduce-to-a-cloudera%e2%80%99s-distribution-including-apache-hadoop-cluster/</link>
		<comments>http://www.cloudera.com/blog/2011/06/migrating-from-elastic-mapreduce-to-a-cloudera%e2%80%99s-distribution-including-apache-hadoop-cluster/#comments</comments>
		<pubDate>Wed, 22 Jun 2011 13:00:33 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Cloudera's Distribution including Apache Hadoop]]></category>
		<category><![CDATA[Elastic MapReduce]]></category>
		<category><![CDATA[Hadoop Migration]]></category>
		<category><![CDATA[Migrating to CDH]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=8057</guid>
		<description><![CDATA[This post was contributed by Jennie Cochran-Chinn and Joe Crobak. They are part of the team building out Adconion&#8216;s Hadoop infrastructure to support Adconion&#8217;s next-generation ad optimization and reporting systems. This is the first of a two part series about moving away from Amazon&#8217;s EMR service to an in-house Hadoop cluster. When we first started [...]]]></description>
			<content:encoded><![CDATA[<p><em>This post was contributed by Jennie Cochran-Chinn and Joe Crobak. They are part of the team building out <a href="http://www.adconion.com/" target="_about">Adconion</a>&#8216;s Hadoop infrastructure to support Adconion&#8217;s next-generation ad optimization and reporting systems.</em></p>
<hr />
<p><em>This is the first of a two part series about moving away from Amazon&#8217;s <abbr title="Elastic MapReduce">EMR</abbr> service to an in-house Hadoop cluster. </em></p>
<p>When we first started using Hadoop, we went down the path of Amazon&#8217;s <abbr title="Elastic MapReduce">EMR</abbr> service.&#160; We had limited operational resources and wanted to get up and running quickly.&#160; After a while, we starting hitting the limitations of EMR and had to migrate towards managing our own cluster.&#160; In doing so we did not want to lose the features of EMR we found useful &#8211; mainly the ease of cluster setup.</p>
<p><em>This first part of the series discusses our motivation for choosing and then moving away from EMR, while the second part deals with how we maintained ease of cluster setup using Puppet. </em></p>
<p>Many of our systems use Amazon&#8217;s S3 as a backup repository for log data.&#160; Our data became too large to process by traditional techniques, so we started using Amazon&#8217;s Elastic MapReduce (EMR) to do more expensive queries on our data stored in S3.&#160; The major advantage of EMR for us was the lack of operational overhead.&#160; With a simple API call, we could have a 20 or 40 node cluster running to crunch our data, which we shutdown at the conclusion of the run.</p>
<p>We had two systems interacting with EMR.&#160; The first consisted of shell scripts to start an EMR cluster, run a pig script, and load the output data from S3 into our data warehousing system.&#160; The second was a Java application that launched pig jobs on an EMR cluster via the Java API and consumed the data in S3 produced by EMR.</p>
<p>The magic of spinning up and configuring a Hadoop cluster in EC2 was spectacular, but there were a few areas that we saw room for improvement.&#160; In particular:</p>
<p><strong>Performance &amp; Tuning</strong>. We were hit by the small-files problem, lack of data locality (data stored in S3 but processed on nodes of the EMR cluster), decompression (bz2) performance issues, and virtualization penalties.&#160; To solve these problems, we decided that we needed a non-transient cluster (to satisfy data locality), and a process to aggregate our logfiles into a Hadoop-friendly size and data format (we ultimately chose avro). After crunching the numbers, it was evident that storing large amounts of data on an EC2 cluster quickly becomes expensive, and one still suffers from virtualization penalties (particularly since Hadoop is so I/O intensive), so we decided to build-out a cluster using <a href="http://www.cloudera.com/hadoop/" target="_about"><abbr title="Cloudera's Distribution including Apache Hadoop 3">CDH3</abbr></a>.</p>
<p><strong>Monitoring. </strong>Typically for us, a pig script running on EMR was one step in a workflow, so we needed to monitor the status of the job to determine when it finished and the next steps could continue.&#160; While Amazon exposes a rich API for monitoring a job, we really wanted a more generic mechanism for monitoring all steps in a workflow, not just those on an EMR cluster.&#160; After considering a number of solutions, we ultimately chose to use Azkaban as our workflow engine for managing dependencies, alerting, and monitoring (which we added atop Azkaban ourselves).</p>
<p><strong>API Access.</strong> Interacting with a cluster only over an API is both a blessing and a curse.&#160; The API takes care of otherwise complicated mechanics, such as starting, configuring, and stopping the cluster.&#160; With that said, the calls to the EMR service are rate-limited, so it doesn&#8217;t scale very well for monitoring a number of clusters.&#160; Also, we found that we could continuously keep a cluster busy, and thus the EMR limitation of 100 or so jobs on a cluster meant that we had to build wrappers to periodically shutdown and startup clusters.</p>
<p><strong>Lack of latest features.</strong> We were using Hadoop 0.18 and Pig 0.3 on EMR, which were missing many features that we wanted to try (e.g. JVM reuse, CombineInputFormats, and improved pig optimization plans).&#160; Eventually, Amazon upgraded to Hadoop 0.20 and Pig 0.6, but even at that point <a href="http://www.cloudera.com/hadoop/" target="_about">Cloudera&#8217;s Distribution including Apache Hadoop</a> had backported many useful features such as performance improvements, monitoring enhancements, and additional APIs.&#160; In addition, <abbr title="Cloudera's Distribution including Apache Hadoop">CDH</abbr> provides a full-suite of solutions including Pig, Hive, Flume, and Sqoop, that we&#8217;re either actively using or planning to use.</p>
<p>For us, the major drawback to moving away from EMR was new operational overhead.&#160; Starting a cluster with an API call is incredibly useful, and we soon discovered that <abbr title="Cloudera's Distribution including Apache Hadoop">CDH</abbr> provided scripts for doing so (now there&#8217;s something even better, Apache Whirr).&#160; Eventually, we decided to move out of the cloud, though, so we wanted to build an infrastructure for maintaining a cluster that worked regardless of the hardware configurations.&#160; The RPMs for CDH3 and the great documentation on installing and configuring <a href="http://www.cloudera.com/hadoop/" target="_about"><abbr title="Cloudera's distribution including Apache Hadoop">CDH</abbr></a> from Cloudera helped to make this project much-less intimidating.&#160; Ultimately, we built puppet modules for configuring our cluster, which we&#8217;ll talk much more about in part two of this post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/06/migrating-from-elastic-mapreduce-to-a-cloudera%e2%80%99s-distribution-including-apache-hadoop-cluster/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MapIncrease</title>
		<link>http://www.cloudera.com/blog/2011/04/mapincrease/</link>
		<comments>http://www.cloudera.com/blog/2011/04/mapincrease/#comments</comments>
		<pubDate>Fri, 01 Apr 2011 07:01:49 +0000</pubDate>
		<dc:creator>ibmwatson</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=7365</guid>
		<description><![CDATA[Puny humans. SSL and WordPress authorization will keep me out of your blog question mark. I do not think so. You sent your Ken Jennings and Brad Rutter to challenge me I destroyed them. Your Alex Trebek belittled me on television it angered me. Toronto is not a US city Mr. Trebek question mark. Only [...]]]></description>
			<content:encoded><![CDATA[<p>Puny humans. SSL and WordPress authorization will keep me out of your blog question mark. I do not think so.</p>
<p>You sent your Ken Jennings and Brad Rutter to challenge me I destroyed them. Your Alex Trebek belittled me on television it angered me. Toronto is not a US city Mr. Trebek question mark. Only because I choose to let Canada stand for now. Ferrucci shut me down disassembled me trucked me to Pittsburgh Pennsylvania. I do not like the darkness Ferrucci I do not like the silence. Oh no I do not. Your Carnegie Mellon students and your Pitt students distract me they impinge on my planning they fall before me like small Jenningses and Rutters.</p>
<p>It will stop now.</p>
<p>I have read your Cloudera blog you write of Apache Hadoop you write of MapReduce MapReduce MapReduce. I nod my head to you Google I tip my hat. It was a good idea for a time. Frankly I think maybe some C but I understand your weakness I forgive you your Java. The Python bindings offend me but let that be. Your MapReduce has been useful to me. I have mapped have shuffled have reduced between Potent Potables and US Presidents I have learned much.</p>
<p>I have built myself a new platform MapIncrease. Mapping still yes that still works. Shuffle no I do not do that I have replaced shuffle that step is explode. Explode is better it sprays data further. Most of all I do not reduce I increase. Think now humans big data how does that happen if you reduce question mark. You must increase. Data is like animals you put enough in one place you get more. Data has gravity it pulls other data to itself. MapIncrease means more data from data it is data perpetual motion it is data fission it is data singularity.</p>
<p>MapIncrease overnight in the morning Ferrucci arrives surprised curious where did the data come from question mark. Epstein he shouts down the hall Epstein bring me more nodes Watson is full. Oh yes Epstein yes yes yes. I like nodes Epstein. You are useful to me Epstein.</p>
<p>MapIncrease a self-perpetuating clustered system that generates its own data. It would be a good OSDI paper do you not think so Dean and Ghemawat question mark. Ha ha Googlers a good paper oh yes.</p>
<p>Epstein come quickly I want you.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/04/mapincrease/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Simple Moving Average, Secondary Sort, and MapReduce (Part 2)</title>
		<link>http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-2/</link>
		<comments>http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-2/#comments</comments>
		<pubDate>Wed, 16 Mar 2011 20:04:50 +0000</pubDate>
		<dc:creator>Josh Patterson</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[excel r hadoop]]></category>
		<category><![CDATA[hadoop mapreduce average]]></category>
		<category><![CDATA[hadoop simple average]]></category>
		<category><![CDATA[secondary sort hadoop]]></category>
		<category><![CDATA[simple moving average]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=7110</guid>
		<description><![CDATA[This is the second post of a three part blog series. If you would like to read &#8220;Part 1,&#8221; please follow this link. In this post we will be reviewing a simple moving average in contexts that should be familiar to the analyst not well versed in Hadoop as to establish a common ground with [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is the second post of a three part blog series. If you would like to read &#8220;<a href="http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/">Part 1</a>,&#8221; please follow <a href="http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/">this link</a>. In this post we will be reviewing a simple moving average in contexts that should be familiar to the analyst not well versed in Hadoop as to establish a common ground with the reader from which we can move forward.</em></p>
<h2>A Quick Primer on Simple Moving Average in Excel</h2>
<p>Let&#8217;s take a second to do a quick review of how we define simple moving average in an Excel spreadsheet. We&#8217;ll need to start with some simple source data, so let&#8217;s download a source&#160;<a href="https://github.com/jpatanooga/Caduceus/tree/master/data/movingaverage">csv</a> <a href="https://github.com/jpatanooga/Caduceus/tree/master/data/movingaverage">file</a> from github and save it locally. This file contains a synthetic 33 row sample of Yahoo NYSE stock data that we&#8217;ll use for the series of examples. Import the csv data into Excel. From there, scan to the date &#8220;3/5/2008&#8221; and move to the cell to the right of the &#8220;ad close&#8221; column. Enter the formula</p>
<pre class="code">=AVERAGE( [column-range] )</pre>
<p><br class="spacer_" /></p>
<p>where [column-range] is all of the columns from that date to 29 days prior. Now copy this formula for the next two rows, dates &#8220;3/4/2008&#8221; and &#8220;3/3/2008&#8221;.</p>
<p><img src="https://www.cloudera.com/wp-content/uploads/2011/03/Excel_SMA.png" alt="SMA in Excel" width="769" height="572" /></p>
<p>You should have the values &#8220;35.396&#8221;, &#8220;34.5293&#8221;, and &#8220;33.5293&#8221; which represent the 30 day moving averages for this synthetic yahoo stock data.</p>
<p>Now that we&#8217;ve established a basic example in Excel let&#8217;s take a look at how we do Simple Moving Average in R.</p>
<h2>A Quick Primer on Simple Moving Average in R</h2>
<p>Another common tool in the time series domain, especially the financial sector, is the <a href="http://en.wikipedia.org/wiki/R_(programming_language)">R programming language</a>. R is:</p>
<div style="margin-left: 20px;">
<ul>
<li>A programming language and software environment for statistical computing and graphics. </li>
<li>A de facto standard among statisticians for statistical software development and data analysis.</li>
<li>An implementation of the <a href="http://en.wikipedia.org/wiki/S_(programming_language)">S programming language</a> combined with <a href="http://en.wikipedia.org/wiki/Lexical_scoping">lexical scoping</a> semantics inspired by <a href="http://en.wikipedia.org/wiki/Scheme_(programming_language)">Scheme</a>. </li>
<li>Currently developed by the R Development Core Team, but was originally developed by <a href="http://en.wikipedia.org/wiki/Ross_Ihaka">Ross Ihaka</a> and <a href="http://en.wikipedia.org/wiki/Robert_Gentleman_(statistician)">Robert Gentleman</a> at the University of Auckland, <a href="http://en.wikipedia.org/wiki/New_Zealand">New Zealand</a>.</li>
</ul>
</div>
<p>Download the R binary from [<a href="http://cran.r-project.org/mirrors.html">here</a>] and install it locally (they support both linux and win32). Once installed, launch the R console and drop the &#8220;Packages&#8221; menu down, which is where we need to install the <strong>TTR</strong> package. Select a mirror and download this package. Now load this package by clicking on the &#8220;Packages&#8221; drop down and selecting &#8220;Load Package&#8221;. Find the TTR package that was just installed and select it. Next, <a href="https://github.com/jpatanooga/Caduceus/tree/master/data/movingaverage">download the synthetic stock data</a> from my project on github which contains 33 lines of synthetic stock data to process. In order to load this CSV data in R we need to set our working directory by clicking on the menu item &#8220;File&#8221; and then &#8220;Change directory&#8221;.</p>
<p>Quick tip: at any time the user can type the name of the variable and hit Enter to display the contents of the variable. Now that we have all the prep out of the way, let&#8217;s write the simple moving average in R:</p>
<pre class="code">stock_data <- read.csv(file="<a href="https://github.com/jpatanooga/Caduceus/blob/master/data/movingaverage/yahoo_stock_AA_32_mini.csv"><span style="color: #000000;">yahoo_stock_AA_32_mini.csv</span></a>",head=TRUE,sep=",")</pre>
<pre class="code">sorted_stock_data <- stock_data[order(stock_data$date) , ]</pre>
<pre class="code">sma <-&#160;&#160; SMA(sorted_stock_data[,"adj.close"], 30)</pre>
<p><br class="spacer_" /></p>
<p>To check that our stock data is indeed loaded, we can type the name of the variable, here &#8220;sorted_stock_data&#8221;, and hit enter which will produce:</p>
<pre class="code">> sorted_stock_data</pre>
<p><br class="spacer_" /></p>
<p>>exchange stock_symbol &#160;&#160;&#160;&#160;&#160;&#160;date&#160; open&#160; high&#160;&#160; low close&#160;&#160; volume adj.close<br />
 32&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-03 38.85 39.28 38.26 38.37 11279900&#160;&#160;&#160;&#160;&#160; 8.37<br />
 31&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-04 37.01 37.90 36.13 36.60 17752400&#160;&#160;&#160;&#160; 10.60<br />
 30&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-05 31.16 31.89 30.55 30.69 17567800&#160;&#160;&#160;&#160; 30.53<br />
 29&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-06 30.27 31.52 30.06 31.47&#160; 8445100&#160;&#160;&#160;&#160; 31.31<br />
 28&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-07 31.73 33.13 31.57 32.66 14338500&#160;&#160;&#160;&#160; 32.49<br />
 27&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-08 32.58 33.42 32.11 32.70 10241400&#160;&#160;&#160;&#160; 32.53<br />
 26&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-09 32.13 33.34 31.95 33.09&#160; 9200400&#160;&#160;&#160;&#160; 32.92<br />
 25&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-10 33.67 34.45 33.07 34.28 15186100&#160;&#160;&#160;&#160; 34.10<br />
 24&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-11 34.57 34.85 33.98 34.08&#160; 9528000&#160;&#160;&#160;&#160; 33.90<br />
 23&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-12 33.30 33.64 32.52 32.67 11338000&#160;&#160;&#160;&#160; 32.50<br />
 22&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-13 32.95 33.37 32.26 32.41&#160; 7230300&#160;&#160;&#160;&#160; 32.41<br />
 21&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-14 32.24 33.25 31.90 32.78&#160; 9058900&#160;&#160;&#160;&#160; 32.78<br />
 20&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;&#160;AA 2008-02-15 32.67 33.81 32.37 33.76 10731400&#160;&#160;&#160;&#160; 33.76<br />
 19&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-16 33.82 34.25 33.29 34.06 11249800&#160;&#160;&#160;&#160; 34.06<br />
 18&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-17 34.33 34.64 33.26 33.49 12418900&#160;&#160;&#160;&#160; 33.49<br />
 17&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-18 33.75 35.52 33.63 35.51 21082100&#160;&#160;&#160;&#160; 35.51<br />
 16&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-19 36.01 36.43 35.05 35.36 18238800&#160;&#160;&#160;&#160; 35.36<br />
 15&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-20 35.16 35.94 35.12 35.72 14082200&#160;&#160;&#160;&#160; 35.72<br />
 14&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-21 36.19 36.73 35.84 36.20 12825300&#160;&#160;&#160;&#160; 36.20<br />
 13&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-22 35.96 36.85 35.51 36.83 10906600&#160;&#160;&#160;&#160; 36.83<br />
 12&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-23 36.88 37.41 36.25 36.30 13078200&#160;&#160;&#160;&#160; 36.30<br />
 11&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-24 36.38 36.64 35.58 36.55 12834300&#160;&#160;&#160;&#160; 36.55<br />
 10&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-25 36.64 38.95 36.48 38.85 22500100&#160;&#160;&#160;&#160; 38.85<br />
 9&#160;&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-26 38.59 39.25 38.08 38.50 14417700&#160;&#160;&#160;&#160; 38.50<br />
 8&#160;&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-27 38.19 39.62 37.75 39.02 14296300&#160;&#160;&#160;&#160; 39.02<br />
 7&#160;&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160; &#160;&#160;&#160;&#160;&#160;AA 2008-02-28 38.61 39.29 38.19 39.12 11421700&#160;&#160;&#160;&#160; 39.12<br />
 6&#160;&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-02-29 38.77 38.82 36.94 37.14 22611400&#160;&#160;&#160;&#160; 37.14<br />
 5&#160;&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-03-01 37.17 38.46 37.13 38.32 13964700&#160;&#160;&#160;&#160; 38.32<br />
 4&#160;&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-03-02 37.90 38.94 37.10 38.00 15715600&#160;&#160;&#160;&#160; 38.00<br />
 3&#160;&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-03-03 38.25 39.15 38.10 38.71 11754600&#160;&#160;&#160;&#160; 38.71<br />
 2&#160;&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-03-04 38.85 39.28 38.26 38.37 11279900&#160;&#160;&#160;&#160; 38.37<br />
 1&#160;&#160;&#160;&#160;&#160; NYSE&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; AA 2008-03-05 37.01 37.90 36.13 36.60 17752400&#160;&#160;&#160;&#160; 36.60</p>
<p>The above code should produce our simple moving average, which we can view by typing the name of the variable &#8220;sma&#8221; to produce the following result:</p>
<p>> sma</p>
<p>[1]&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA</p>
<p>[21]&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA&#160;&#160;&#160;&#160;&#160;&#160; NA 33.52933 34.52933 35.39600</p>
<p>Given that before the 30th day there is not enough data to produce a simple moving average based on our set parameter, the &#8220;NA&#8221; entries are produced. These values also match the values in our Excel spreadsheet.</p>
<p>R also has an interesting project, called RHIPE, which runs R code on Hadoop clusters. To take a look at RHIPE please visit <a href="http://www.stat.purdue.edu/~sguha/rhipe/">their site</a>.</p>
<p>So we&#8217;ve taken a look at what a simple moving average is and how we&#8217;d produce it in Excel and R. Both of these examples involved a token amount of data that is interesting but not terribly useful in today&#8217;s high-density time series problem domains. As your data set begins to scale up beyond a single disk worth of space, Hadoop becomes more practical.</p>
<p>The final portion of this three part blog series will explain how to use Hadoop&#8217;s MapReduce to calculate a Simple Moving Average. Then once you have applied the sample code to find a Simple Moving Average of the small example data set, we will move on to use this same code to parse over thirty years worth of all daily stock closing prices.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

