<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; pig</title>
	<atom:link href="http://www.cloudera.com/blog/category/pig/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Thu, 24 May 2012 17:53:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Using Apache Hadoop to Find Signal in the Noise: Analyzing Adverse Drug Events</title>
		<link>http://www.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-events/</link>
		<comments>http://www.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-events/#comments</comments>
		<pubDate>Wed, 16 Nov 2011 17:54:58 +0000</pubDate>
		<dc:creator>Josh Wills</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[Use Case]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9314</guid>
		<description><![CDATA[Last month at the Web 2.0 Summit in San Francisco, Cloudera CEO Mike Olson&#160;presented some work the Cloudera Data Science Team did to analyze adverse drug events. We decided to share more detail about this project because it demonstrates how to use a variety of open-source tools &#8211;&#160;R, Gephi, and Cloudera&#8217;s Distribution Including Apache Hadoop [...]]]></description>
			<content:encoded><![CDATA[<p>Last month at the Web 2.0 Summit in San Francisco, Cloudera CEO Mike Olson&#160;<a href="http://www.informationweek.com/video/1227036510001" target="_blank">presented</a> some work the Cloudera Data Science Team did to analyze adverse drug events. We decided to share more detail about this project because it demonstrates how to use a variety of open-source tools &#8211;&#160;<a href="http://www.r-project.org/">R</a>, <a href="http://gephi.org/">Gephi</a>, and <a href="http://www.cloudera.com/hadoop/">Cloudera&#8217;s Distribution Including Apache Hadoop</a> (CDH) &#8211; to solve an old problem in a new way.</p>
<h1 style="text-align: left">Background: Adverse Drug Events</h1>
<p style="text-align: left">An adverse drug event (ADE) is an unwanted or unintended reaction that results from the normal use of one or more medications. The consequences of ADEs range from mild allergic reactions to death, with <a href="http://www.ahrq.gov/qual/aderia/aderia.htm#14">one study</a> estimating that 9.7% of adverse drug events lead to permanent disability. <a href="http://www.ahrq.gov/qual/aderia/aderia.htm#1">Another study</a> showed that each patient who experiences an ADE remains hospitalized for an additional 1-5 days and costs the hospital up to $9,000.</p>
<p style="text-align: left">Some adverse drug events are caused by drug interactions, where two or more prescription or over-the-counter (OTC) drugs taken together leads to an unexpected outcome. As the population ages and more patients are treated for multiple health conditions, the risk of ADEs from drug interactions increases. In the United States, roughly <a href="http://jama.ama-assn.org/content/300/24/2867.short">4% of adults older than 55 are at risk for a major drug interaction</a>.</p>
<p style="text-align: left">Because clinical trials study a relatively small number of patients, both regulatory agencies and pharmaceutical companies maintain databases in order to track adverse events that occur after drugs have been approved for market. In the United States, the <a href="http://www.fda.gov/">FDA</a> uses the <a href="http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/default.htm">Adverse Event Reporting System</a> (AERS), where healthcare professionals and consumers may report the details of ADEs they experienced. &#160;The FDA makes a well-formatted <a href="http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm082193.htm" target="_blank">sample of the reports</a> available&#160;for download from their website, to the benefit of data scientists everywhere.</p>
<h1 style="text-align: left">Methodology</h1>
<p style="text-align: left">Identifying ADEs is primarily a <a href="http://www.quora.com/What-are-some-good-resources-for-learning-about-signal-estimation-and-detection" target="_blank">signal detection problem</a>: we have a collection of events, where each event has multiple attributes (in this case, the drugs the patient was taking) and multiple outcomes (the adverse reactions that the patient experienced), and we would like to understand how the attributes correlate with the outcomes. One simple technique for analyzing these relationships is a <a href="http://en.wikipedia.org/wiki/Contingency_table" target="_blank">2&#215;2 contingency table</a>:</p>
<table style="text-align: center" border="0" cellspacing="0" cellpadding="0" width="681">
<col span="4" width="170"></col>
<tbody>
<tr>
<td width="170" height="42">
<p style="text-align: center">For All Drugs/Reactions:</p>
</td>
<td width="170">
<p>Reaction = R<sub>j</sub></p>
</td>
<td width="170">
<p>Reaction != R<sub>j</sub></p>
</td>
<td width="170">
<p>Total</p>
</td>
</tr>
<tr>
<td width="170" height="42">
<p style="text-align: center">Drug = D<sub>i</sub></p>
</td>
<td width="170">
<p>A</p>
</td>
<td width="170">
<p>B</p>
</td>
<td width="170">
<p>A + B</p>
</td>
</tr>
<tr>
<td width="170" height="42">
<p style="text-align: center">Drug != D<sub>i</sub></p>
</td>
<td width="170">
<p>C</p>
</td>
<td width="170">
<p>D</p>
</td>
<td width="170">
<p>C + D</p>
</td>
</tr>
<tr>
<td width="170" height="42">
<p>Total</p>
</td>
<td width="170">
<p>A + C</p>
</td>
<td width="170">
<p>B + D</p>
</td>
<td width="170">
<p style="text-align: center">A + B + C + D</p>
</td>
</tr>
</tbody>
</table>
<p style="text-align: center">&#160;</p>
<p style="text-align: left">Based on the values in the cells of the tables, we can compute various <a href="http://www.ncbi.nlm.nih.gov/pubmed/11998548" target="_blank">measures of disproportionality</a> to find drug-reaction pairs that occur more frequently than we would expect if they were independent.</p>
<p style="text-align: left">For this project, we analyzed interactions involving multiple drugs, using a generalization of the contingency table method that is described in the paper, &#8220;<a href="http://dl.acm.org/citation.cfm?id=502526" target="_blank">Empirical bayes screening for multi-item associations</a>&#8221; by DuMouchel and Pregibon. Their model computes a Multi-Item Gamma-Poisson Shrinkage (MGPS) estimator for each combination of drugs and outcomes, and gives us a statistically sound measure of disproportionality even if we only have a handful of observations for a particular combination of drugs. The MGPS model has been used for a variety of signal detection problems across multiple industries, such as identifying fraudulent phone calls, performing market basket analyses and <a href="http://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=2&amp;ved=0CEkQFjAB&amp;url=http%3A%2F%2Fonlinepubs.trb.org%2Fonlinepubs%2FUA%2F111510DuMouchel.pdf&amp;ei=8Wu3TpqTAqHliAKsl8Vj&amp;usg=AFQjCNGX7xhIX1hv_cs2_OMjlAvzkqmkkg" target="_blank">analyzing defects in automobiles</a>.</p>
<h1>Solving the Hard Problem with Apache Hadoop</h1>
<p>At first glance, it doesn&#8217;t seem like we would need anything beyond a laptop to analyze ADEs, since the FDA only receives about one million reports a year. But when we begin to examine these reports, we discover a problem that is similar to what happens when we attempt to teach computers to play chess: a <a href="http://en.wikipedia.org/wiki/Combinatorial_explosion" target="_blank">combinatorial explosion</a> in the number of possible drug interactions we must consider. Even restricting ourselves to analyzing pairs of drugs, there are more than 3 <em>trillion </em>potential drug-drug-reaction triples in the AERS dataset, and tens of millions of triples that we actually see in the data. Even including the iterative <a href="http://www.seanborman.com/publications/EM_algorithm.pdf" target="_blank">Expectation Maximization algorithm</a> that we use to fit the MGPS model, the total runtime of our analysis is dominated by the process of counting how often the various interactions occur.</p>
<p>The good news is that MapReduce running on a Hadoop cluster is ideal for this problem. By creating a pipeline of MapReduce jobs to clean, aggregate, and join our data, we can parallelize the counting problem across multiple machines to achieve a linear speedup in our overall runtime. The faster runtime for each individual analysis allows us to iterate rapidly on smaller models and tackle larger problems involving more drug interactions than anyone has ever looked at before.</p>
<h1>Visualizing the Results</h1>
<p>The output of our analysis is a collection of drug-drug-reaction triples that have very large disproportionality scores. But as we all know, <a href="http://xkcd.com/552/" target="_blank">correlation is not causation</a>. The output of our analysis provides us with useful information that should be filtered and evaluated by domain experts and used as the basis for further study using controlled experiments.</p>
<p>With that caveat in mind, our analysis revealed a few drug pairs with surprisingly high correlations with adverse events that did not show up in a search of the academic literature: gabapentin (a seizure medication) taken in conjunction with hydrocodone/paracetamol&#160;was correlated with memory impairment, and <a href="http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0000604/" target="_blank">haloperidol</a> in conjunction with <a href="http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0000560/" target="_blank">lorazepam</a> was correlated with the patient entering into a coma.</p>
<p>Even with restrictive filters applied to the drug-drug-reaction triples, we still end up with tens of thousands of triples that score high enough to merit further investigation. In addition to looking at individual triples, we can also use graph visualization tools like Gephi to explore the macro-level structure of the data. Gephi has a number of powerful layout algorithms and filtering tools that allow us to impose structure on an undifferentiated mass of data points.&#160;Here is a graph in which the vertices are drugs and the thickness of the edges represent the number of high scoring adverse reactions that feature each pair of drugs:</p>
<p><a href="https://www.cloudera.com/wp-content/uploads/2011/11/wholegraph1.png"><img class="alignnone size-full wp-image-9393" src="https://www.cloudera.com/wp-content/uploads/2011/11/wholegraph1.png" alt="" width="1420" height="811" /></a></p>
<p><br class="spacer_" /></p>
<p>We can also pan and zoom to different regions of the graph and highlight clusters of drug interactions. Here is a cluster of drugs that are used in treating HIV:</p>
<p><a href="https://www.cloudera.com/wp-content/uploads/2011/11/hivcluster.png"><img class="size-full wp-image-9395" src="https://www.cloudera.com/wp-content/uploads/2011/11/hivcluster.png" alt="A cluster of HIV-related drugs" width="1420" height="811" /></a></p>
<p><br class="spacer_" /></p>
<p>And here is a cluster of drugs that are used to fight cancer:</p>
<p><a href="https://www.cloudera.com/wp-content/uploads/2011/11/cancercluster.png"><img class="size-full wp-image-9396" src="https://www.cloudera.com/wp-content/uploads/2011/11/cancercluster.png" alt="A cluster of cancer-related drugs" width="1297" height="811" /></a></p>
<p><br class="spacer_" /></p>
<p><span style="font-weight: normal">The combination of Apache Hadoop, R, and Gephi changes the way we think about analyzing adverse drug events. Instead of focusing on a handful of outcomes, we can process all of the events in the data set at the same time. We can try out hundreds of different strategies for cleaning records, stratifying observations into clusters, and scoring drug-reaction tuples, run everything in parallel, and analyze the data at a fraction of the cost of a traditional supercomputer. We can render the results of our analyses using visualization tools that can be used by domain experts to explore relationships within our data that they might never have thought to look for. By dramatically reducing the costs of exploration and experimentation, we foster an environment that enables innovation and discovery.</span></p>
<h1><strong>Open Data, Open Analysis</strong></h1>
<p>This project was possible because the <a href="http://www.fda.gov/Drugs/default.htm" target="_blank">FDA&#8217;s Center for Drug Evaluation and Research</a> makes a portion of their data open and available to anyone who wants to <a href="http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm082193.htm" target="_blank">download it</a>. In turn, we are releasing a well-commented version of the code we used to analyze that data &#8211;&#160;a mixture of Java, Pig, R, and Python &#8211;&#160;on the <a href="http://github.com/cloudera/ades" target="_blank">Cloudera github repository</a> under the&#160;<a href="http://www.apache.org/licenses/LICENSE-2.0.html" target="_blank">Apache License</a>. We also contributed the most useful Pig function developed for this project, which&#160;<a href="http://code.google.com/p/szl/source/browse/trunk/src/emitters/szlcomputequantiles.cc" target="_blank">computes approximate quantiles for a stream of&#160;numbers</a>, to LinkedIn&#8217;s <a href="https://github.com/linkedin/datafu" target="_blank">datafu</a> library. We hope to collaborate with the community to improve the models over time and create a resource for students, researchers, and fellow data scientists.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-events/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Hadoop World 2011: A Glimpse into Development</title>
		<link>http://www.cloudera.com/blog/2011/10/hadoop-world-2011-a-glimpse-into-development/</link>
		<comments>http://www.cloudera.com/blog/2011/10/hadoop-world-2011-a-glimpse-into-development/#comments</comments>
		<pubDate>Wed, 12 Oct 2011 13:00:42 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[careers]]></category>
		<category><![CDATA[CDH]]></category>
		<category><![CDATA[Cloudera's Service and Configuration Manager]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[Connector]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[Flume]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[oozie]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[sqoop]]></category>
		<category><![CDATA[training]]></category>
		<category><![CDATA[Use Case]]></category>
		<category><![CDATA[ZooKeeper]]></category>
		<category><![CDATA[hadoop conference]]></category>
		<category><![CDATA[hadoop event]]></category>
		<category><![CDATA[hadoop world]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9240</guid>
		<description><![CDATA[The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hadoopworld.com/"><img style="float: left; padding-right: 20px;" title="Register for Hadoop World" src="https://www.cloudera.com/wp-content/uploads/2010/08/hw_white2.gif" alt="" /></a></p>
<p>The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.</p>
<h2 style="font-size: 14pt; color: #344152;"><a href="http://www.hadoopworld.com/tracks/development-developers/" target="_blank">Preview of Development Track Sessions</a></h2>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Building Web Analytics Processing on Hadoop at CBS Interactive</span></a><br />
 <em>Michael Sun, CBS Interactive</em></p>
<p><strong>Abstract:</strong> CBS Interactive successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties that CBS Interactive oversees. After introducing Lumberjack&#8212;the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release&#8212;Michael will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, CBS Interactive achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).</p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Gateway: Cluster Virtualization Framework</span></a><br />
<em>Konstantin Shvachko, eBay</em></p>
<p><strong>Abstract:</strong> Access to Hadoop clusters through dedicated portal nodes (typically located behind firewalls and performing user authentication and authorization) can have several drawbacks &#8212; as shared multitenant resources they can create contention among users and increase the maintenance overhead for cluster administrators. This session will discuss the Gateway system, a cluster virtualization framework that provides multiple benefits: seamless access from users&#8217; workplace computers through corporate firewalls; the ability to failover to active clusters for scheduled or unscheduled downtime, as well as the ability to redirect traffic to other clusters during upgrades; and user access to clusters running different versions of Hadoop. </p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">SHERPASURFING &#8211; Open Source Cyber Security Solution</span></a><br />
<em>Wayne Wheeles, Novii Design</em></p>
<p><strong>Abstract:</strong> Every day billions of packets, both benign and some malicious, flow in and out of networks. Every day it is an essential task for the modern Defensive Cyber Security Organization to be able to reliably survive the sheer volume of data, bring the NETFLOW data to rest, enrich it, correlate it and perform. SHERPASURFING is an open source platform built on the proven Cloudera&#8217;s Distribution including Apache Hadoop that enables organizations to perform the Cyber Security mission and at scale at an affordable price point. This session will include an overview of the solution and components, followed by a demonstration of analytics. </p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools</span></a><br />
<em>Arvind Prabhakar, Cloudera<br />
Guy Harrison, Quest Software</em></p>
<p><strong>Abstract:</strong> As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative. We&#8217;ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we&#8217;ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc. </p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Next Generation Apache Hadoop MapReduce</span></a><br />
<em>Mahadev Konar, Hortonworks</em></p>
<p><strong>Abstract:</strong> The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale, high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization. We will be presenting the architecture and design of the next generation of map reduce and will delve into the details of the architecture that makes it much easier to innovate. We will also be presenting large scale and small scale comparisons on some benchmarks with MRV1.&#8221; </p>
<p><a href="http://www.hadoopworld.com/"><img title="Register for Hadoop World" src="https://www.cloudera.com/wp-content/uploads/2010/12/registernow.gif" alt="Register for Hadoop World" /></a></p>
<p>There are several <a href="http://www.hadoopworld.com/training/">training classes</a> and <a href="http://www.hadoopworld.com/training/">certification sessions</a> provided surrounding the Hadoop World conference. Don&#8217;t forget to register and become <a href="http://www.hadoopworld.com/training/">Cloudera Certified in Apache Hadoop</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/10/hadoop-world-2011-a-glimpse-into-development/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>New Features in Apache Pig 0.8</title>
		<link>http://www.cloudera.com/blog/2010/12/new-features-in-apache-pig-0-8/</link>
		<comments>http://www.cloudera.com/blog/2010/12/new-features-in-apache-pig-0-8/#comments</comments>
		<pubDate>Tue, 21 Dec 2010 14:34:58 +0000</pubDate>
		<dc:creator>John Kreisa</dc:creator>
				<category><![CDATA[pig]]></category>
		<category><![CDATA[#cdh3]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[CDH]]></category>
		<category><![CDATA[cdh3b2]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=5691</guid>
		<description><![CDATA[This is a guest post contributed by Dmitriy Ryaboy (@squarecog) and was originally published in his blog on December 19th. We thought&#160;the information&#160;was&#160;valuable enough&#160;that it was worth&#160;reposting to spread the word even further.&#160; The Pig 0.8 release includes a large number of bug fixes and optimizations, but at the core it is a feature release. [...]]]></description>
			<content:encoded><![CDATA[<h2 id="post-104">This is a guest post contributed by Dmitriy Ryaboy (@squarecog) and was originally published in his <a href="http://squarecog.wordpress.com/2010/12/19/new-features-in-apache-pig-0-8/" target="_blank">blog </a>on December 19th. We thought&#160;the information&#160;was&#160;valuable enough&#160;that it was worth&#160;reposting to spread the word even further.&#160;</h2>
<div>
<p>The Pig 0.8 release includes a large number of bug fixes and optimizations, but at the core it is a feature release. It&#8217;s been in the works for almost a full year (most of the work on 0.7 was completed by January of 2009, although it took a while to actually get the release out), and the amount of time spent on 0.8 really shows.</p>
<p>I <a href="http://squarecog.wordpress.com/2010/08/20/upcoming-features-in-pig-0-8-dynamic-invokers/">meant</a> to describe these in detail in a series of posts, but it seems blogging regularly is not my forte. This release is so chock-full of great new features, however, that I feel compelled to at least list them. So, behold, in no particular order, a non-exhaustive list of new features I am excited about in Pig 0.8:</p>
<li><strong>Support for UDFs in scripting languages</strong></li>
<p>This is exactly what it sounds like &#8212; if your favorite language has a JVM implementation, it can be used to create Pig UDFs.</p>
<p>Pig now ships with support for UDFs in Jython, but other languages can be supported by implementing a few interfaces. Details about the Pig UDFs in Python can be found here: <a href="http://pig.apache.org/docs/r0.8.0/udf.html#Python+UDFs">http://pig.apache.org/docs/r0.8.0/udf.html#Python+UDFs</a></p>
<p>This is the outcome of <a href="http://issues.apache.org/jira/browse/PIG-928">PIG-928</a>; it was quite a pleasure to watch this develop over time &#8212; while most Pig tickets wind up getting worked on by at most one or two people, this turned into a collaboration of quite a few developers, many of them new to the project &#8212; Kishore Gopalakrishna&#8217;s patch was the initial conversation starter, which was then hacked on or merged into similar work by Woody Anderson, Arnab Nandi, Julien Le Dem, Ashutosh Chauhan and Aniket Mokashi (Aniket deserves an extra shout-out for patiently working to incorporate everyone&#8217;s feedback and pushing the patch through the last mile).</p>
<li><strong>PigUnit</strong></li>
<p>A contribution by Romain Rigaux, PigUnit is exactly what it sounds like &#8212; a tool that simplifies the Pig users&#8217; lives by giving them a simple way to unit test Pig scripts.</p>
<p>The documentation at <a href="http://pig.apache.org/docs/r0.8.0/pigunit.html">http://pig.apache.org/docs/r0.8.0/pigunit.html</a> and the code at <a href="http://svn.apache.org/viewvc/pig/trunk/test/org/apache/pig/test/pigunit/TestPigTest.java?view=markup">http://svn.apache.org/viewvc/pig/trunk/test/org/apache/pig/test/pigunit/TestPigTest.java?view=markup</a> speak for themselves as far as usage.</p>
<li><strong>PigStats</strong></li>
<p>Pig can now provide much better visibility into what is going on inside a Pig job than it ever did before, thanks to extensive work by Richard Ding (see <a href="http://issues.apache.org/jira/browse/PIG-1333">PIG-1333</a> and <a href="http://issues.apache.org/jira/browse/PIG-1478">PIG-1478</a>). This feature is a feature in three parts:</p>
<p>1. Script statistics.<br />
This is the most easily visible change. At the end of running a script, Pig will output a table with some basic statistics regarding the jobs that it ran. It looks something like this:</p>
<p>Job Stats (time in seconds):</p>
<table>
<tbody>
<tr>
<td>JobId</td>
<td>Maps</td>
<td>Reduces</td>
<td>Max<br />
Map<br />
Time</td>
<td>Min<br />
Map<br />
Time</td>
<td>Avg<br />
Map<br />
Time</td>
<td>Max<br />
Reduce<br />
Time</td>
<td>Min<br />
Reduce<br />
Time</td>
<td>Avg<br />
Reduce<br />
Time</td>
<td>Alias</td>
<td>Feature</td>
<td>Outputs</td>
</tr>
<tr>
<td>job_xxx</td>
<td>1654</td>
<td>218</td>
<td>84</td>
<td>6</td>
<td>14</td>
<td>107</td>
<td>87</td>
<td>99</td>
<td>counted_data,<br />
data,<br />
grouped_data</td>
<td>GROUP_BY,<br />
COMBINER</td>
<td>&#160;</td>
</tr>
<tr>
<td>job_xxx</td>
<td>2</td>
<td>1</td>
<td>9</td>
<td>6</td>
<td>7</td>
<td>13</td>
<td>13</td>
<td>13</td>
<td>ordered_data</td>
<td>SAMPLER</td>
<td>&#160;</td>
</tr>
<tr>
<td>job_xxx</td>
<td>2</td>
<td>1</td>
<td>26</td>
<td>18</td>
<td>22</td>
<td>31</td>
<td>31</td>
<td>31</td>
<td>ordered_data</td>
<td>ORDER_BY</td>
<td>hdfs://tmp/out,</td>
</tr>
</tbody>
</table>
<p>This is extremely useful when debugging slow jobs, as you can immediately identify which stages of your script are slow, and correlate the slow Map-Reduce jobs with the actual Pig operators and relations in your script &#8212; something that was not trivial before (folks often resorted to setting parallelism to slightly different numbers for every join and group just to figure out which job was doing what. No more of this!)</p>
<p>2. Data in Job XML</p>
<p>Pig now inserts several interesting properties into the Hadoop jobs that it generates, including the relations being generated, Pig features being used, and ids of parent Hadoop jobs. This is quite helpful when monitoring a cluster, and is also handy when examining job history using the HadoopJobHistoryLoader , now part of piggybank (use Pig to mine your job history!).</p>
<p>3. PigRunner API</p>
<p>The same information that is printed out when Pig runs the script from a command line is available if one uses the Java API to start Pig jobs. If you start a script using the <code>PigRunner.run(String args[], ProgressNotificationListener listener)</code>, you will get as a result a <a href="http://pig.apache.org/docs/r0.8.0/api/org/apache/pig/tools/pigstats/PigStats.html">PigStats</a> object that gives you access to the job hierarchy, the Hadoop counters from each job, and so on. You can implement the optional <a href="http://pig.apache.org/docs/r0.8.0/api/org/apache/pig/tools/pigstats/PigProgressNotificationListener.html">ProgressNotificationListener</a> if you want to watch the job as it progresses; the listener will be notified as different component jobs start and finish.</p>
<p>Documentation of the API, new properties in the Job XML, and more, is available at <a href="http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Pig+Statistics">http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Pig+Statistics</a></p>
<li><strong>Scalar values</strong></li>
<p>It&#8217;s very common to need to use some calculated statistic in a calculation to inform other calculations. For example, consider a data set that consists of people and their eye color; we want to calculate the fraction of the total population that has a given eye color. The script looks something like this:</p>
<pre>people = LOAD '/data/people' using PigStorage()
  AS (person_id:long, eye_color:chararray);
num_people = FOREACH (group people all)
  GENERATE COUNT(people) AS total;
eye_color_fractions = FOREACH ( GROUP people BY eye_color )
  GENERATE
    group as eye_color,
    COUNT(people) / num_people.total AS fraction;</pre>
<p>&#160;</p>
<p>Pretty straightforward, except it does not work. What&#8217;s happening in the above code is that we are referencing the relation <code>num_people</code> from inside another relation, <code>eye_color_fractions</code> and this doesn&#8217;t really make sense if Pig does not know that <code>num_people</code> only has one row.</p>
<p>In the past you had to do something hacky like joining the two relations on a constant to replicate the total into each row, and then generate the division. Needless to say, this was not entirely satisfactory. In <a href="http://issues.apache.org/jira/browse/PIG-1434">PIG-1434</a> Aniket Mokashi tackled this, implementing an elegant solution that hides all of these details from the user &#8212; you can now simply cast a single-row relation as a scalar, and use it as desired. The above script becomes:</p>
<pre>people = LOAD '/data/people' using PigStorage()
  AS (person_id:long, eye_color:chararray);
num_people = FOREACH (group people all)
  GENERATE COUNT(people) AS total;
eye_color_fractions = FOREACH ( GROUP people BY eye_color )
  GENERATE
    group as eye_color,
    COUNT(people) / <strong>(long)</strong> num_people.total AS fraction;</pre>
<p>&#160;</p>
<p>This makes the casting explicit, but Pig is now smart enough to do this implicitly as well. A runtime exception is generated if the relation being used as a scalar contains more than one tuple.</p>
<p>More documentation of this feature is available at <a href="http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Casting+Relations+to+Scalars">http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Casting+Relations+to+Scalars</a></p>
<li><strong>Monitored UDFs</strong></li>
<p>A new annotation has been added, <code>@MonitoredUDF</code>, which makes Pig spawn a watcher thread that kills an execution that is taking too long, and return a default value instead. This comes in handy when dealing with certain operations like complex regular expressions. More documentation is available at <a href="http://pig.apache.org/docs/r0.8.0/udf.html#Monitoring+long-running+UDFs">http://pig.apache.org/docs/r0.8.0/udf.html#Monitoring+long-running+UDFs</a></p>
<li><strong>Automatic merge of small files</strong></li>
<p>This is a simple one, but useful &#8212; when running Pig over many small files, instead of creating a map task per file (paying the overhead of scheduling and running a task for a computation that might only take a few seconds), we can merge the inputs and create a few map tasks that are a bit more hefty.</p>
<p>Two properties control this behavior: <code>pig.maxCombinedSplitSize</code> controls the maximum size of the resulting split, and <code>pig.splitCombination</code> controls whether or not the feature is activated in the first place (it is on by default).</p>
<p>This work is documented in the ticket <a href="http://issues.apache.org/jira/browse/PIG-1518">PIG-1518</a>; there are additional details in the release notes attached to the ticket.</p>
<li><strong>Generic UDFs</strong></li>
<p>I <a href="http://squarecog.wordpress.com/2010/08/20/upcoming-features-in-pig-0-8-dynamic-invokers/">wrote about this one</a> before &#8212; a small feature that allows you to invoke static Java methods as Pig UDFs without needing to wrap them in custom code.</p>
<p>The official documentation is available at <a href="http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Dynamic+Invokers">http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Dynamic+Invokers</a></p>
<li><strong>Safeguards against missing PARALLEL keyword</strong></li>
<p>One of the more common mistakes people make when writing Pig scripts is forgetting to specify parallelism for operators that need it. The default behavior used to be that this means parallelism of 1, which can lead to extremely inefficient jobs. A patch by Jeff Zhang in <a href="http://issues.apache.org/jira/browse/PIG-1249">PIG-1249</a> changes this behavior to instead use a simple heuristic: if parallelism is not specified, derive the number of reducers by taking <code>MIN(max_reducers, total_input_size / bytes_per_reducer)</code>. Max number of reducers is controlled by the property <code>pig.exec.reducers.max</code> (default 999) and bytes per reducer are controlled by <code>pig.exec.reducers.bytes.per.reducer</code> (default 1GB).</p>
<p>This is a safeguard, not a panacea; it only works with file-based input, estimates number of reducers based on input size, not the size of the intermediate data &#8212; so if you have a highly selective filter, or you are grouping a large dataset by a low-cardinality field, it will produce bad number &#8212; but it&#8217;s a nice safeguard against dramatic misconfigurations.</p>
<blockquote><p>When porting to Apache Pig 0.8, remember to audit your scripts for parallelized operators that do not specify the <code>PARALLEL</code> keyword &#8212; if the intent is to use a single reducer, make this intent explicit by specifying <code>PARALLEL 1</code>.</p>
</blockquote>
<li><strong>HBaseStorage</strong></li>
<p>HBaseStorage has been shored up in Pig 0.8. It can now read data stored in as bytes instead of requiring all numbers to be converted to Strings; it accepts a number of options &#8212; limit the number of rows returned, push down filters on HBase keys, etc. In addition, it can now be used to write to HBase in addition to reading from it. Details about the options, etc, can be found in the Release Notes section of <a href="http://issues.apache.org/jira/browse/PIG-1205">PIG-1205</a>.</p>
<p>Note that at the moment this only works with the HBase 0.20.{4,5,6} releases, and does not work with 0.89+. There is a patch in <a href="http://issues.apache.org/jira/browse/PIG-1680">PIG-1680</a> that you can apply if you need 0.89 and 0.90 compatibility; it is not applied to the main codebase yet, as it is not backwards compatible.</p>
<p>We are very interested in help making this Storage engine more featureful, please feel free to jump in and contribute!</p>
<li><strong>Support for custom Map-Reduce jobs in the flow</strong></li>
<p>Although we try to make these a rarity, sometimes cases come up in which a custom Map-Reduce job fits the bill better than Pig. Weaving a Map-Reduce job into the middle of a Pig workflow was awkward before &#8212; you had to use something like Oozie or Azkaban, or write your own workflow application. Pig 0.8 introduces a simple &#8220;MAPREDUCE&#8221; operator which allows you to invoke an opaque MR job in the middle of the flow, and continue with Pig:</p>
<pre>text = load 'WordcountInput.txt';
wordcount = MAPREDUCE wordcount.jar
  STORE text INTO 'inputDir'
  LOAD 'outputDir' AS (word:chararray, count: int)
  `org.myorg.WordCount inputDir outputDir`;</pre>
<p>&#160;</p>
<p>Details are available on the wiki page: <a href="http://wiki.apache.org/pig/NativeMapReduce">http://wiki.apache.org/pig/NativeMapReduce</a></p>
<p>The ticket for this one has been open for a while, since Pig 0.2 days, and it&#8217;s nice to see it finally implemented. Thumbs up to Aniket Mokashi for this one.</p>
<li><strong>Custom Partitioners</strong></li>
<p>This feature, also implemented by the amazingly productive Aniket Mokashi, is also a bit of a power-user thing (and also an ancient ticket, PIG-282). It allows the Pig script author to control the function used to distribute map output among reducers. By default, Pig uses a random hash partitioner, but sometimes a custom algorithm is required when the script author knows something particularly unique about the reduce key distribution. When that is the case, a user can now specify the Hadoop Partitioner to swap in instead of the default:</p>
<p><code>B = group A by $0 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; </code></p>
<p>More specific documentation can be found in the Release Notes section of <a href="http://issues.apache.org/jira/browse/PIG-282">PIG-282</a></p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/12/new-features-in-apache-pig-0-8/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Hadoop World: NYC &#8211; Training</title>
		<link>http://www.cloudera.com/blog/2010/08/hadoopworld-training/</link>
		<comments>http://www.cloudera.com/blog/2010/08/hadoopworld-training/#comments</comments>
		<pubDate>Thu, 19 Aug 2010 15:00:23 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[training]]></category>
		<category><![CDATA[hadoopworld]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4354</guid>
		<description><![CDATA[Hadoop Training surrounding Hadoop World: NYC.]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify">Our vision for Hadoop World is a conference where both newcomers and experienced Hadoop users can learn and be part of the growing Hadoop community.</p>
<p style="text-align: justify">We are also offering training sessions for newcomers and experienced Hadoop users alike. Whether you are looking for an Introduction to Hadoop, Hadoop Certification, or you want to learn more about related Hadoop projects we have the training you are looking for.</p>
<p style="text-align: center"><img class="size-full wp-image-4403    aligncenter" src="http://www.cloudera.com/wp-content/uploads/2010/08/hw_white2.gif" alt="" width="169" height="130" /></p>
<p style="text-align: justify">Included with our top-notch Hadoop training you will have full access to Hadoop World free of charge.</p>
<p style="text-align: justify">Available Training Sessions include:<span id="more-4354"></span></p>
<h2 style="text-align: justify"><span style="text-decoration: underline">Oct 11:</span></h2>
<h3 style="text-align: justify"><em>Introduction to Hadoop</em>:&#160;<a href="http://www.eventbrite.com/event/762326138">http://www.eventbrite.com/event/762326138</a></h3>
<p style="text-align: justify">This one-day course provides a solid foundation for those seeking to understand large scale data processing with MapReduce and Hadoop. This session is designed for developers, analysts or system administrators that are new to Hadoop. This course provides the pre-requisite knowledge for the later classes: Developer Training, Administrator Training or Analyzing Data with Hive and Pig.</p>
<h3 style="text-align: justify"><em>Hadoop Essentials For Managers: </em><em> </em><a href="http://www.eventbrite.com/event/762237874">http://www.eventbrite.com/event/762237874</a></h3>
<p style="text-align: justify">This one-day course will give decision-makers the information they need to know about Apache Hadoop, answering questions such as:</p>
<ul style="text-align: justify">
<li>When is Hadoop appropriate?</li>
<li>What are people using Hadoop      for?</li>
<li>How does Hadoop fit into our      existing environment?</li>
<li>What do I need to know about      choosing Hadoop?</li>
</ul>
<h3 style="text-align: justify"><em>Cloudera HUE SDK Training</em>:&#160;<a href="http://www.eventbrite.com/event/764021208">http://www.eventbrite.com/event/764021208</a></h3>
<p style="text-align: justify">Cloudera Hue provides developers with back end APIs to simplify interacting with Hadoop and front end APIs to deliver rich, web based, graphical user experiences. For this training, developers should have experience building web apps using modern MVC frameworks and Ajax. Experience with Python and Django is a strong plus. In this session we spend half the day covering the following topics, and the other half of the day interactively building applications with the Cloudera Hue team.</p>
<h2 style="text-align: justify"><span style="text-decoration: underline">Oct 13 &amp; 14:</span></h2>
<h3 style="text-align: justify"><em>Developer Training &amp; Certification</em>:&#160;<a href="http://www.eventbrite.com/event/762320120">http://www.eventbrite.com/event/762320120</a></h3>
<p style="text-align: justify">In this two-day hands-on session, developers learn the MapReduce framework and how to write programs against its API. In addition to learning how to write individual MapReduce jobs, we discuss design techniques for larger workflows. This course also covers advanced skills for debugging MapReduce programs and optimizing their performance. At the end of the course, attendees have the option to take a certification exam documenting their understanding of the concepts taught during the training session.</p>
<h3 style="text-align: justify"><em>Administrator Training &amp; Certification:</em> <a href="http://www.eventbrite.com/event/762677188">http://www.eventbrite.com/event/762677188</a></h3>
<p style="text-align: justify">This two-day hands-on session covers the system administration aspects of Hadoop from installation and configuration to load balancing and tuning including diagnosing and solving problems in your deployment. At the end of the course, attendees have the option of taking a certification exam documenting their understanding of the concepts taught at the training session.</p>
<h3 style="text-align: justify"><em>Analyzing Data with Hive and Pig:</em> <a href="http://www.eventbrite.com/event/762318114">http://www.eventbrite.com/event/762318114</a></h3>
<p style="text-align: justify">Cloudera&#8217;s two-day hands-on course on Hive and Pig is designed for people who have a basic understanding of how Hadoop works and want to utilize these languages for analysis of their data. Hive makes Hadoop accessible to users who already know SQL; Pig is similar to popular scripting languages. This course teachs you how to process data by using filters, joins, user-defined functions and more.</p>
<h2 style="text-align: justify"><span style="text-decoration: underline">Oct 15:</span></h2>
<h3 style="text-align: justify"><em>HBase Training</em>:&#160;<a href="http://www.eventbrite.com/event/762317111">http://www.eventbrite.com/event/762317111</a></h3>
<p style="text-align: justify">This one-day hands-on course gives you the necessary knowledge for using HBase as a distributed data store to achieve low-latency queries and highly scalable throughput. This class covers the HBase architecture, data model, and Java API as well as advanced topics and best practices. This course is for developers who already have a basic understanding of Hadoop (Java experience is recommended).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/hadoopworld-training/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Migrating to CDH</title>
		<link>http://www.cloudera.com/blog/2010/08/migrating-to-cdh3/</link>
		<comments>http://www.cloudera.com/blog/2010/08/migrating-to-cdh3/#comments</comments>
		<pubDate>Tue, 03 Aug 2010 01:32:37 +0000</pubDate>
		<dc:creator>Eric Sammer</dc:creator>
				<category><![CDATA[distribution]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[ZooKeeper]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4147</guid>
		<description><![CDATA[With the recent release of CDH3b2, many users are more interested than ever to try out Cloudera&#8217;s Distribution for Hadoop (CDH). One of the questions we often hear is, &#8220;what does it take to migrate?&#8221;. Why Migrate? If you&#8217;re not familiar with CDH3b2, here&#8217;s what you need to know. All versions of CDH provide: RPM [...]]]></description>
			<content:encoded><![CDATA[<p>With the <a href="http://www.cloudera.com/blog/2010/06/cdhv3-and-cloudera-enterprise/">recent release of CDH3b2</a>, many users are more interested than ever to try out Cloudera&#8217;s Distribution for Hadoop (CDH). One of the questions we often hear is, &#8220;what does it take to migrate?&#8221;.</p>
<h2>Why Migrate?</h2>
<p>If you&#8217;re not familiar with CDH3b2, here&#8217;s what you need to know.</p>
<p>All versions of CDH provide:</p>
<ul>
<li>RPM and Debian packages for simple installation and management.</li>
<li>Clean integration with the host operating system. Logs are in <code>/var/log</code>, common binaries in <code>/usr/bin</code>, and configuration in <code>/etc</code>.</li>
<li>A Cloudera support-ready distribution. As Hadoop becomes a mission critical component of your production infrastructure, you&#8217;ll want the option of engaging Cloudera for support or consulting services. Running CDH makes this process simple.</li>
</ul>
<p>CDH3b2 additionally is:</p>
<ul>
<li>A complete platform with smooth integration of popular projects such as Hive, HBase, Pig, Zookeeper, Flume, Sqoop, Oozie, and HUE. HDFS and Hadoop Map Reduce are only two parts of a larger system. CDH3b2 brings together tools frameworks to get data in and out of HDFS, coordinate complex processing pipelines, as well as process and analyze your data. <a href="http://www.cloudera.com/blog/2010/07/more-on-clouderas-distribution-for-hadoop-3/">Learn more</a> about this.</li>
<li>Based on Apache Hadoop 0.20.2 with 320 patches worth of feature back ports, stability enhancements, and bug fixes.</li>
</ul>
<h2>Overview</h2>
<p>The migration process does require a moderate understanding of Linux system administration. You should make a plan before you start. You will be restarting some critical services such as the name node and job tracker, so some downtime is necessary. Given the value of the data on your cluster, you&#8217;ll also want to be careful to take recent back ups of any mission-critical data sets as well as the name node meta-data.</p>
<p>Backing up your data is most important if you&#8217;re upgrading from a version of Hadoop based on an Apache Software Foundation release earlier than 0.20. There were changes in the open source HDFS implementation prior to 0.20 that force this upgrade. See the section below on compatibility for more details.</p>
<p>The process I&#8217;ll outline here is as follows:</p>
<ul>
<li>CDH version selection</li>
<li>Options for installation</li>
<li>Installation process</li>
<li>Migration of configuration data</li>
<li>Testing your cluster</li>
</ul>
<h2>Selecting a Branch</h2>
<p>One of the first questions you should ask yourself is what level of stability versus new features you require from Hadoop. If you&#8217;re managing a production Hadoop cluster with jobs with SLAs, you need a rock solid, production-proven Hadoop distribution. This is Cloudera&#8217;s stable or production branch. At the time of this writing, this is CDH2 based on Hadoop 0.20.1+169.89. In certain cases, features may be of greater priority, in which case, CDH3 0.20.2+320 is appropriate.</p>
<p>It&#8217;s important to note that both CDH2 and CDH3 pass all functional and unit tests at Cloudera. The real difference between them is that CDH2 has been in the field longer. We generally promote a release to stable when we&#8217;ve seen it running production workloads for a substantial period of time, and when the rate of issues opened against the distro in our support group tails off. We have customers running in production today on both CDH2 an CDH3.</p>
<h2>On Compatibility</h2>
<p>Before we dive into the installation process I&#8217;ll highlight some points on compatibility. When upgrading to CDH from an older version or another distribution of Hadoop, it&#8217;s possible that HDFS data needs to be taken through an upgrade process. This is relatively simple, but as with any upgrade of critical data, it is absolutely necessary to back up your data.</p>
<p>Currently, it is not necessary to perform an HDFS upgrade if you&#8217;re upgrading to CDH3 from CDH2 or Apache Hadoop versions 0.20.0 or later. In fact, any distribution of Hadoop based on Apache 0.20.0 is likely to be a clean transition without an update to HDFS required, but you should always check with the distributor.</p>
<p>During RPC operations, all Hadoop daemons will check to ensure they are speaking to the same exact version as themselves. This means that you cannot, at present, perform a rolling upgrade of CDH. There has been some discussion about relaxing this requirement so compatible versions of Hadoop can communicate, but this has not yet been implemented.</p>
<h2>Installation Options</h2>
<p>CDH is available in three forms: RPMs, debs, and tarball distributions. The preferred method of installation is usually the RPM or deb packages as they automate a lot of the work required to get CDH up and running quickly. Tarballs of CDH are useful for users on systems that do not use yum/rpm or apt/dpkg, or where you do not have root access to the host operating system.</p>
<h2>Installing CDH</h2>
<p>When installing CDH from from RPMs or Debian packages you will definitely want to take advantage of Cloudera&#8217;s yum or apt repository support. If you&#8217;re on a system that is not rpm or deb format packages, you can still use Cloudera&#8217;s binary tarball packages.</p>
<p>You should follow the normal process for installing CDH on your systems. The CDH packages should be installed on all nodes in the cluster. The rpm and deb packages of CDH will automatically create a hadoop user and group as well as SYSV init scripts as part of the install process. The CDH tarballs do not contain the init scripts and obviously do not create the hadoop user and group.</p>
<p>Detailed <a href="https://docs.cloudera.com/display/DOC/CDH3+Installation+Guide">installation instructions</a> for all formats of CDH are available.</p>
<p>After the packages are installed, you&#8217;ll want to make sure you set the proper daemons to start on the proper machines upon boot. There is a separate init script for each Hadoop daemon so only what is necessary is started.</p>
<p>Redhat example:<br />
<code>% chkconfig --level 3 hadoop-0.20-namenode on</code></p>
<p>Debian example:<br />
<code>% update-rc.d hadoop-0.20-namenode start 80 3 .</code></p>
<p>Make sure you specify the correct run level. While run level 3 is common for multiuser Linux servers, this may not be the case in your installation. You can use the runlevel command to find the currently active run level.</p>
<p>For now, do not start any of the Hadoop daemons.</p>
<h2>Migrating Your Configuration</h2>
<p>If you&#8217;re coming from older version of CDH, your configuration should already be setup with alternatives. If not, now is a good time to bring your configuration layout in line with CDH by moving your conf directory to <code>/etc/hadoop-0.20/conf.mycluster</code>. You should also configure alternatives to know about your new configuration. The <a href="https://docs.cloudera.com/display/DOC/CDH3+Installation">CDH documentation</a> covers this in detail. For now, register your new configuration with alternatives and set it to be the preferred configuration.</p>
<p><code><br />
% alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.mycluster 100<br />
% alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.mycluster<br />
</code></p>
<p>Users who are on systems that don&#8217;t have alternatives or who are installing CDH from tarballs should simply update the configuration files in <code>$HADOOP_HOME/conf</code>. Normally, <code>$HADOOP_HOME</code> is <code>/usr/local/hadoop-$VERSION</code> or <code>/opt/hadoop-$VERSION</code> but you can put it wherever it makes sense. This includes running CDH from your home directory if you don&#8217;t have root access.</p>
<h2>Testing CDH</h2>
<p>Now that CDH is installed and you&#8217;ve migrated your cluster configuration it&#8217;s time to fire up a few nodes and make sure everything is working as expected. Rather than bring up all the daemons at once, let&#8217;s focus on the name node first.</p>
<p>Start by logging on to the name node machine. You may want to manually rotate the log file just to minimize the noise during testing. You can do this by simply moving today&#8217;s log file to a different name.</p>
<p><code>% mv /var/log/hadoop/hadoop-hadoop-namenode-nn.mycompany.com.log \<br />
/var/log/hadoop/hadoop-hadoop-namenode-nn.mycompany.com.log.old</code></p>
<p>Next, start the CDH name node daemon using the provided init script. If an HDFS upgrade is required, you can use the <code>upgrade</code> argument in place of <code>start</code> below. This will be your last chance to grab a backup of the name node&#8217;s metadata prior to starting the daemon.</p>
<p><code>% /etc/init.d/hadoop-0.20-namenode start</code></p>
<p>Note that the CDH init scripts require you to be root whereas the Apache Hadoop start-all.sh / stop-all.sh scripts should <em>not</em> be run as root.</p>
<p>It&#8217;s a good idea to check the contents of the name node log file now to ensure it has come up cleanly. You should see a warning about the name node being in safe mode due to missing blocks. This is OK because we haven&#8217;t brought up any data nodes yet. If something doesn&#8217;t look right, jump ahead to the getting help section before proceeding.</p>
<p>Before you start any of your data nodes, you&#8217;ll want to place the name node in safe mode manually. This will prevent the name node from &#8220;panicking&#8221; and trying to repair missing block replicas as data nodes begin to register themselves. You&#8217;ll need to run this command as the hadoop user.</p>
<p><code>% hadoop dfsadmin -safemode enter</code></p>
<p>Next start one of the data nodes and watch its logs as you did for the name node.</p>
<p><code>% /etc/init.d/hadoop-0.20-datanode start</code></p>
<p>If everything is setup correctly, you should see the data node start up, register with the name node, and start its periodic block scanner thread. You should also check the name node logs to confirm you see the data node registration message there as well. Once you&#8217;ve confirmed that things look good, you should move on to starting additional data nodes checking them in batches as you go.</p>
<p>After all data nodes are up and running, you can use the Hadoop fsck tool to confirm that the file system is healthy.</p>
<p><code>% hadoop fsck /</code></p>
<p>Your cluster should still be in safe mode. If the file system is healthy, you can go ahead and take it out of safe mode.</p>
<p><code>% hadoop dfsadmin -safemode leave</code></p>
<p>Follow this with a quick test of HDFS by copying a file into the file system.</p>
<p><code>% date > now.txt<br />
% hadoop fs -put now.txt /now.txt<br />
% hadoop fs -cat /now.txt<br />
% hadoop fs -rm /now.txt<br />
% rm now.txt</code></p>
<p>Congratulations! You now have HDFS running on CDH.</p>
<p>If you had to upgrade the HDFS data &#8211; that is, you started the init script with the <code>upgrade</code> option &#8211; you should do some more extensive testing of your data. Once you&#8217;ve confirmed everything is working as expected, finalize the HDFS upgrade.</p>
<p><code>% hadoop namenode -finalize</code></p>
<p>Starting and testing the map reduce daemons follows a similar procedure but is a bit simpler. Start the job tracker daemon on the proper machine and monitor the logs as you did with the name node. Once you&#8217;ve confirmed the job tracker is running, proceed with starting the task tracker daemons in groups checking the job tracker UI as you go. You should see the map and reduce task capacity increasing with each node you start. Don&#8217;t panic if the job tracker doesn&#8217;t see the nodes immediately; it can take a few seconds.</p>
<p>Don&#8217;t forget to start the secondary name node daemon as well. It&#8217;s usually a good idea to wait an hour or so and check the modification time on the files in the configured fs.checkpoint.dir. You should see that the files have been updated within the last hour. You can also check the secondary name node logs; you&#8217;ll see an indication things are working there as well in the form of some log messages about performing the checkpoint.</p>
<h2>Documentation and References</h2>
<p>In addition to the community articles and blog posts on Hadoop, Cloudera provides CDH-specific documentation at <a href="http://docs.cloudera.com">docs.cloudera.com</a>. Here you can find information on CDH including all of its components like Hadoop, Hive, Flume, Sqoop, HUE, and others.</p>
<h2>How to Get Help</h2>
<p>There are a number of ways to get help if you run into trouble during your migration or if you just have questions.</p>
<ul>
<li><a href="http://docs.cloudera.com">Cloudera Documentation</a></li>
<li><a href="http://groups.google.com/a/cloudera.org/groups/dir">Cloudera mailing lists</a></li>
<li><a href="http://www.cloudera.com/resources/?media=Video">Cloudera videos</a></li>
<li>IRC users can join #cloudera on <a href="http://freenode.net">freenode</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/migrating-to-cdh3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Announcing Two New Training Classes from Cloudera: Introduction to HBase and Analyzing Data with Hive and Pig</title>
		<link>http://www.cloudera.com/blog/2010/07/announcing-two-new-training-classes-from-cloudera-introduction-to-hbase-and-analyzing-data-with-hive-and-pig/</link>
		<comments>http://www.cloudera.com/blog/2010/07/announcing-two-new-training-classes-from-cloudera-introduction-to-hbase-and-analyzing-data-with-hive-and-pig/#comments</comments>
		<pubDate>Wed, 28 Jul 2010 13:52:22 +0000</pubDate>
		<dc:creator>John Kreisa</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[administration]]></category>
		<category><![CDATA[developer]]></category>
		<category><![CDATA[training]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4177</guid>
		<description><![CDATA[Cloudera is pleased to announce two new training courses: a one-day Introduction to HBase and a two-day session on Analyzing Data with Hive and Pig. These join a recently-expanded two-day Hadoop for Administrators course and our popular three-day Hadoop for Developers offering, any of which can be combined to provide extensive, customized training for your [...]]]></description>
			<content:encoded><![CDATA[<p>Cloudera is pleased to announce two new training courses: a one-day Introduction to HBase and a two-day session on Analyzing Data with Hive and Pig. These join a recently-expanded two-day Hadoop for Administrators course and our popular three-day Hadoop for Developers offering, any of which can be combined to provide extensive, customized training for your organization. Please contact <a href="mailto:sales@cloudera.com">sales@cloudera.com</a> for more information regarding on-site training, or visit <a href="http://www.cloudera.com/hadoop-training">www.cloudera.com/hadoop-training</a> to view our public course schedule.</p>
<p>Cloudera&#8217;s HBase course discusses use-cases for HBase, and covers the HBase architecture, schema modeling, access patterns, and performance considerations. During hands-on exercises, students write code to access HBase from Java applications, and use the HBase shell to manipulate data. Introduction to HBase also covers deployment and advanced features.</p>
<p>Our Hive and Pig course is designed for developers who are skilled with SQL or scripting languages, but who are not Java experts. Hive and Pig are two approaches which allow non-Java programmers to access and manipulate massive amounts of data while abstracting away the complexities of MapReduce. Hive offers an SQL-like interface, while Pig&#8217;s scripting language, named PigLatin, is very easy for developers learn. This course covers both technologies, and includes multiple hands-on exercises to reinforce key concepts.</p>
<p>Cloudera&#8217;s Hadoop for System Administrators course has recently been expanded from one day to two, and covers the important issues for System Administrators charged with looking after Hadoop clusters. Topics include planning and deploying the cluster, managing MapReduce jobs, scheduling jobs using the Fair Scheduler, cluster monitoring and troubleshooting, populating HDFS from existing relational database management systems with Sqoop, and using Flume to import logs and other files into HDFS.</p>
<p>Our most popular course, Hadoop for Developers, is a three-day offering which covers everything from an introduction to HDFS and MapReduce right through to advanced MapReduce APIs and algorithms. Students learn to build MapReduce jobs through a combination of instructor-led training and hands-on exercises; the course includes an exam offering students the chance to earn Cloudera Certified Hadoop Developer credentials.</p>
<p>A complete list of events including upcoming training is available here: <a href="http://www.cloudera.com/company/events/">http://www.cloudera.com/company/events/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/07/announcing-two-new-training-classes-from-cloudera-introduction-to-hbase-and-analyzing-data-with-hive-and-pig/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What&#8217;s New in CDH3b2: Oozie</title>
		<link>http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-oozie/</link>
		<comments>http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-oozie/#comments</comments>
		<pubDate>Thu, 15 Jul 2010 15:00:23 +0000</pubDate>
		<dc:creator>Arvind Prabhakar</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[pig]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=3831</guid>
		<description><![CDATA[Hadoop has emerged as an indispensable component of any data-intensive enterprise infrastructure. &#160;In many ways, working with large datasets on a distributed computing platform (powered by commodity hardware or cloud infrastructure) has never been easier. But because customers are running clusters consisting of hundreds or thousands of nodes, and are processing massive quantities of data [...]]]></description>
			<content:encoded><![CDATA[<p>Hadoop has emerged as an indispensable component of any data-intensive enterprise infrastructure. &#160;In many ways, working with large datasets on a distributed computing platform (powered by commodity hardware or cloud infrastructure) has never been easier. But because customers are running clusters consisting of hundreds or thousands of nodes, and are processing massive quantities of data from production systems every hour, the logistics of efficient platform utilization can quickly become overwhelming.</p>
<p>To deal with this challenge, the <a href="http://developer.yahoo.com/hadoop/">Yahoo! engineering team</a> created <a href="http://yahoo.github.com/oozie/">Oozie &#8211; the Hadoop workflow engine</a>. We are pleased to provide Oozie with Cloudera&#8217;s distribution for Hadoop starting with the beta-2 release.</p>
<h2>Why create a new workflow system?</h2>
<p>You might wonder why a new workflow system is necessary for Hadoop, given that there are quite a few existing commercial and open-source systems available. &#160;While it is possible to use existing general-purpose workflow systems with Hadoop, it is anything but simple. Intricacies such as monitoring long running jobs and interfacing with the distributed file system require extensive work to port general workflow systems to the Hadoop environment. Oozie, on the other hand, is designed specifically for the Hadoop platform and uses it as its execution environment. It has built-in support for Hadoop tasks and integrates with this environment cleanly. Oozie itself is fairly light-weight, requires minimal configuration, and scales linearly &#8211; thus offering a sustainable approach to building workflows in the Hadoop environment.</p>
<p>Still not convinced about Oozie? Consider these numbers for a moment: According to the <a href="http://www.slideshare.net/ydn/5-oozie-hadoopsummit2010">Oozie presentation during Hadoop Summit</a> in June &#8211; there are over 4800+ workflow applications deployed within Yahoo! at the moment, with largest workflow containing 2000 actions. There were roughly 55,000 workflow jobs that Yahoo! infrastructure team executed in the month of May 2010 alone, with workflows that could run up to many hours.</p>
<h2>A Simple Use-Case</h2>
<p>Consider the example of web log analysis. For a typical operation that deals with a few gigabytes of log data every day, the steps involved in analyzing it can be many. First, the files have to be moved into a certain location. Next, the files are used to create new tables or partitions in Hive which are then queried to see if certain criteria have been met. For instance, if the number of accesses for a particular resource exceeds a certain&#160;threshold, some notifications must be generated. Regardless of the outcome of this analysis, certain other queries need be run in order to populate other tables that record rolled-up information.</p>
<p>While these steps are not difficult to execute, they are repetitive and time consuming. Ideally, such steps should be automated in a manner that notifications are raised to operators when something interesting is discovered by the system or if there is a failure of some sort. That is exactly what Oozie does.</p>
<p>Using Oozie, all the steps outlined in this example can be modeled as a workflow which can be executed with a single command. Once the workflow takes off, you can sit back and relax while Oozie runs through each step of the flow.</p>
<h2>Oozie Highlights</h2>
<p>Oozie workflow bundles the workflow definition, any libraries necessary for the execution of workflow actions, and properties that are necessary to resolve parameterized values in the workflow. Together, this bundle is referred to as an Oozie application and informally &#8211; a workflow. These are deployed to the Oozie server using a command line utility. Once deployed, the workflows can be started and manipulated as necessary using the same utility. The web console for Oozie server can be used to monitor the progress of various workflow jobs being managed by the server.</p>
<h3><em>Scalability</em></h3>
<p>Oozie is a server-based web application that uses a transactional store to manage workflow metadata and execution states. It relies on HTTP-based notifications and polling mechanism to monitor the progress of workflows and to manage its runtime state. The Oozie server itself does not do any particular work other than this state management. All of the work is delegated to worker nodes within the cluster on which the workflow executes.&#160;This allows Oozie to scale horizontally by adding more Oozie servers pointing to the same workflow metadata store.</p>
<h3><em>Resilience</em></h3>
<p>When a workflow execution encounters a transient error condition, Oozie automatically attempts to execute the action again. In some situations, when the error requires user intervention, Oozie can suspend the workflow&#160;indefinitely&#160;allowing the administrator to step in, take corrective action, and resume the workflow. For long running workflows that fail, Oozie provides a mechanism by which the workflow can be restarted from the point of failure to avoid redoing the steps that may have already completed earlier.</p>
<h3><em>Simple and&#160;Intuitive</em></h3>
<p>Workflows in Oozie are expressed in a simple XML representation that is inspired by process definition language &#8211; JPDL. However, compared to the overal JPDL schema, the Oozie schema is extremely simplified and intuitive. The key concepts in a Oozie workflow is that of <em>action</em> and <em>control flow</em> nodes. Action nodes do the workflow tasks &#8211; such as moving files, running Map/Reduce jobs, running Hive Queries etc. Control-flow nodes govern the progress of the workflow from action to action, enabling things like error handling, conditional execution and branching logic.</p>
<p>Together, the action and control-flow nodes are arranged in a directed acyclic graph (DAG), which represents the overall workflow. This DAG is executed by the Oozie server in a controlled-dependency&#160;manner &#8211; implying that a node shall be executed if and only if all of the nodes that it depends upon have been executed successfully. This is very similar to how one would manually implement the workflow &#8211; by initiating actions and starting follow-up actions when the previous ones are complete with expected outcome.</p>
<p>The bottom line is that if you know the steps you need to take for managing data in your Hadoop&#160;environment, you can easily express them as a workflow and hand it off to Oozie for execution.</p>
<h3><em>Rich Set of Features</em></h3>
<p>The core feature set of Oozie is designed to take care of the most commonly-exercised functionality for the Hadoop platform. The key objective behind these features is to ensure that anything done manually can be implemented as a workflow task to the last detail. The following list, while far from being exhaustive, lists out some of the many features that Oozie has.</p>
<ol>
<li><strong>1. Parameterization</strong>: With parameterization support, workflows can be written once and executed many times with different parameter bindings. This allows reuse of workflows in a manner that promotes ease of maintenance and management.</li>
<li><strong>2. Fine-Grain and Coarse-Grain Notification Support</strong>: Workflows in Oozie can be configured to notify external systems at varying degree of granularity. Notifications can be raised when a workflow changes its overall state, or when an individual action within it changes state. These notifications are implemented as HTTP GET requests which can pass extra information to the receiver such as job identifier. Using this mechanism, external systems can be integrated at various stages of the workflow as necessary.</li>
<li><strong>3. User Propagation</strong>: The user and group information associated with the workflow job is propagated by Oozie to the underlying action execution and cannot be overwritten. &#160;This allows Oozie to work together with Hadoop security to ensure that actions are authorized to access and manipulate data where applicable.</li>
<li><strong>4. Java Client API</strong>: A programmatic client API is provided by Oozie that can be called from external systems to better integrate with the Oozie system. This API provides equivalent functionality as provided by the command line client utility.</li>
<li><strong>5. Web Services API</strong>: Oozie also provides a rich REST/JSON API for web-services integration. Clients that prefer to access Oozie via this API can directly access and manipulate workflows running on Oozie server using this interface.</li>
<li><strong>6. Built in actions</strong>: The default actions &#160;provided by Oozie cover a vast majority of the use-cases for workflows including:
<ul>
<li><em>Map/Reduce action</em>: Allows you to model Map/Reduce jobs.</li>
<li><em>Streaming Map/Reduce action</em>: Allows you to specify executable mapper and reducers that can be plugged in using the streaming support.</li>
<li><em>Pig action</em>: Allows you to run custom Pig scripts and tasks.</li>
<li><em>FS action</em>: Allows you to manipulate the Hadoop file system as necessary.</li>
<li><em>SSH action</em>: Allows you to securely execute commands over SSH connections.</li>
<li><em>Sub-Workflow action</em>: Allows a workflow to be executed within another workflow.</li>
<li><em>Java action</em>: Allows you to plug in any Java program that has a main method for direct execution.</li>
<li><em>Hive action</em>: Allows you to run Hive commands from within the workflow. This action is contributed by Cloudera.</li>
<li><em>Sqoop action</em>: Allows you to run Sqoop commands from within the workflow. This action is contributed by Cloudera.</li>
</ul>
</li>
<li><strong>7. Built in control-flow nodes</strong>: The control-flow nodes provided by Oozie allow you to create&#160;sophisticated&#160;workflow graphs. Together with built-in support for JSP Expression Language functions, the control-flow nodes can be used for creating conditional executions where necessary. These include the following:
<ul>
<li><em>Start node</em>: Indicates the starting point of a workflow.</li>
<li><em>End node</em>: Indicates a completion point of a workflow.</li>
<li><em>Kill node</em>: Allows a workflow action outcome to cause workflow termination where necessary.</li>
<li><em>Decision node</em>: Allows a branching construct similar to traditional switch-case statement using JSP Expression Language expressions as predicates.</li>
<li><em>Fork and Join node</em>: Fork node allows the splitting of a workflow execution path into multiple concurrent paths of execution.&#160;Join nodes allow the branches created by fork node to merge back into a single execution path. Fork and join nodes must be used in pairs and allow efficient parallel execution of tasks that do not have a direct-controlled dependency order.</li>
</ul>
</li>
<li><strong>8. Custom Extensions</strong>: Oozie provides an extension mechanism that allows the implementation of custom actions. This mechanism should be used when the extension cannot be modeled as a regular Java action.</li>
</ol>
<h2>Get up and running!</h2>
<p>Now that you have a good feel for what Oozie is and how simple it is, it is time for you to get your instance of Oozie installed and configured. Follow our <a href="https://docs.cloudera.com/display/DOC/Oozie+Installation">Quick Start Guide</a> to get up and running with Oozie in a matter of minutes. You can reach out to the Oozie user group by sending a mail to <a href="mailto:oozie-users@yahoogroups.com">oozie-users@yahoogroups.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-oozie/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>What&#8217;s New in CDH3b2: Pig</title>
		<link>http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-beta-2-pig/</link>
		<comments>http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-beta-2-pig/#comments</comments>
		<pubDate>Wed, 14 Jul 2010 19:41:54 +0000</pubDate>
		<dc:creator>Carl Steinbach</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[pig]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=3964</guid>
		<description><![CDATA[CDH3 beta 2 includes Pig 0.7.0, the latest and greatest version of the popular dataflow programming environment for Hadoop. In this post I&#8217;ll review some of the bigger changes that went into Pig 0.7.0, describe the motivations behind these changes, and explain how they affect users. Readers in search of a canonical list of changes [...]]]></description>
			<content:encoded><![CDATA[<p>CDH3 beta 2 includes <a href="http://hadoop.apache.org/pig/">Pig 0.7.0</a>, the latest and greatest version of the popular dataflow programming environment for Hadoop. In this post I&#8217;ll review some of the bigger changes that went into Pig 0.7.0, describe the motivations behind these changes, and explain how they affect users. Readers in search of a canonical list of changes in this new version of Pig should consult the <a href="http://archive.apache.org/dist/hadoop/pig/pig-0.7.0/RELEASE_NOTES.txt">Pig 0.7.0 Release Notes</a> as well as the list of&#160;<a href="http://wiki.apache.org/pig/Pig070IncompatibleChanges">backward incompatible changes</a>.</p>
<h2>Load-Store Redesign</h2>
<p>The biggest change to appear in Pig 0.7.0 is the <a href="http://wiki.apache.org/pig/LoadStoreRedesignProposal">complete redesign of the LoadFunc and StoreFunc interfaces</a>. The Load-Store interfaces were first introduced in version 0.1.0 and have remained largely unchanged up to this point. Pig uses a concrete instance of the LoadFunc interface to read Pig records from the underlying storage layer, and similarly uses an instance of the StoreFunc interface when it needs to write a record. Pig provides different LoadFunc and StoreFunc implementations in order to support different storage formats, and since this is a public interface users may provide their own implementations as well.</p>
<p>The primary motivation for redesigning these interfaces is to bring them into closer alignment with Hadoop&#8217;s InputFormat and OutputFormat interfaces, with the goal of making it much easier to write new LoadFunc and StoreFunc implementations based on existing Hadoop InputFormat and OutputFormat classes. At the same time the new interfaces were also made a lot more powerful by providing direct access to configurations as well as the ability to selectively read individual columns.</p>
<p>In the short span of time since these new interfaces appeared the Pig community has responded by writing a variety of custom Loaders including ones for&#160;<a href="http://issues.apache.org/jira/browse/CASSANDRA-910">Cassandra</a>, <a href="http://groups.google.com/group/project-voldemort/browse_thread/thread/1ef4cf1c3f647458/a09ba5e354feb791">Voldemort</a>, and <a href="http://issues.apache.org/jira/browse/PIG-1117">Hive&#8217;s RCFile columnar storage format</a>. It is important to note that these new plugins were written without any direct involvement from the Pig core team, which is a significant validation of the work that went into the redesign effort. A list of third-party Pig Loaders is maintained on the <a href="http://wiki.apache.org/pig/PigInteroperability">Pig Intoperability</a> page. Users who are interested in writing their own LoadFuncs or StoreFuncs should first read the updated <a href="http://wiki.apache.org/pig/Pig070LoadStoreHowTo">Load-Store HowTo</a>.</p>
<p>If you are upgrading from an earlier version of Pig you need to be aware that the new Load/Store interfaces are not backward compatible with the old interfaces. Users who have written custom LoadFuncs or StoreFuncs that work with an earlier version will need to upgrade these functions to use the new interfaces. For more details about this process please consult the <a href="http://wiki.apache.org/pig/LoadStoreMigrationGuide">Load-Store Migration Guide</a> on the Pig wiki.</p>
<h2>Use the Distributed Cache to Improve Performance</h2>
<p>Pig 0.7.0 includes a set of important performance enhancements that aim to make queries run faster by leveraging Hadoop&#8217;s <a href="http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/filecache/DistributedCache.html">Distributed Cache</a>. The key observation that motivated these changes is that Pig query plans often involve directing a large number of tasks to read the same sample of data. One can observe this access pattern in the Fragment-Replicate Join, SkewedJoin, and GroupBy operators. Earlier versions of Pig read this data directly from the underlying distributed file system, an approach that is inefficient, but also has the potential to cause a cluster-wide failure if a large number of concurrent Map tasks swamp the NameNode with read requests. <a href="http://issues.apache.org/jira/browse/PIG-872">PIG-872</a> and <a href="http://issues.apache.org/jira/browse/PIG-1218">PIG-1218</a> remedy this problem by loading the common data into the Distributed Cache. This allows tasks to perform a local disk read instead of having to wait while the data is retrieved from HDFS, and also allows tasks that run on the same node to share the same data.</p>
<h2>Use Hadoop&#8217;s Local Mode for Pig Local Mode</h2>
<p>One of things that has made Pig especially easy for new users to pick up is its support for a local mode that does not require an Hadoop installation. Unfortunately, maintaining this feature has turned into a major headache for the Pig developers as it requires a large body of custom code and execution paths that are not shared with the rest of the system. A direct consequence of this is that many of the new features that have been added to Pig do not work in local mode, and this has caused a lot of confusion within the Pig user community. Based on these factors the Pig developers decided that it made sense to <a href="http://issues.apache.org/jira/browse/PIG-1053">replace Pig&#8217;s custom local mode implementation with one that depends on Hadoop&#8217;s local mode</a>. This change benefits Pig users since they can now test a script in local mode and be confident that it will run correctly in distributed mode, or vice-versa. However, users should be aware that there is one unfortunate side-effect of this change: Pig now runs roughly an order of magnitude slower in local mode.</p>
<h2>Making Pig 0.7.0 Even Better</h2>
<p>Pig 0.7.0 was released in mid-May, and since that time several important patches have appeared for bugs that were found in the original release. These patches include a <a href="http://issues.apache.org/jira/browse/PIG-1428">fix that allows UDFs to access counters</a>, as well as another fix that <a href="http://issues.apache.org/jira/browse/PIG-1299">adds a counter to track the number of output rows in each output file</a>. I think you&#8217;ll be glad to hear that we have included these patches as well as others in the version of Pig 0.7.0 that is included in CDH3 beta 2.</p>
<h2>For More Information</h2>
<p>We hope you&#8217;ll give CDH and Pig a spin. The <a href="https://docs.cloudera.com/display/DOC/Hadoop+(CDH3)+Quick+Start+Guide">CDH Quick Start Guide</a> is the best place to begin, followed with the Pig installation instructions in the <a href="https://docs.cloudera.com/display/DOC/Pig+Installation">Pig Installation Guide</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-beta-2-pig/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>CDH3 Beta 1 Now Available</title>
		<link>http://www.cloudera.com/blog/2010/03/cdh3-beta1-now-available/</link>
		<comments>http://www.cloudera.com/blog/2010/03/cdh3-beta1-now-available/#comments</comments>
		<pubDate>Wed, 24 Mar 2010 15:00:25 +0000</pubDate>
		<dc:creator>Eli Collins</dc:creator>
				<category><![CDATA[distribution]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[pig]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=2895</guid>
		<description><![CDATA[It&#8217;s official &#8211; Cloudera&#8217;s Distribution for Hadoop Version 2, which we often shorthand as CDH2, has been released. CDH2 is the product we recommend to our current production customers. It&#8217;s a stable version that has undergone a long cycle of time in the field with a variety of customers, in addition to Cloudera&#8217;s internal QA [...]]]></description>
			<content:encoded><![CDATA[<div id="_mcePaste">
<p>It&#8217;s official &#8211; Cloudera&#8217;s Distribution for Hadoop Version 2, which we often shorthand as <a href="http://www.cloudera.com/blog/2010/03/cdh2-is-released">CDH2</a>, has been released. CDH2 is the product we recommend to our current production customers. It&#8217;s a stable version that has undergone a long cycle of time in the field with a variety of customers, in addition to Cloudera&#8217;s internal QA process.</p>
</div>
<div>
<p>And with the CDH2 release, the Cloudera engineering team is excited to start the feedback and development process for the next version of Cloudera&#8217;s Distribution for Hadoop &#8211; Version 3. CDH3 includes a <a href="http://pig.apache.org/" target="_about">Pig</a> package with additional  bug fixes and performance improvements, and the <a href="http://hadoop.apache.org/hive">Hive</a> package is now based  on the  latest Apache release. One of the most notable aspects of CDH3 <a href="http://archive.cloudera.com/docs/cdh.html">beta 1</a> is what has not changed: CDH3 remains based on the Apache 0.20 release. However, we have already bundled many new improvements and bug fixes in CDH3. The <a href="http://archive.cloudera.com/cdh/3/hadoop-0.20.2+228.releasenotes.html">release notes</a> cover these changes in detail.</p>
</div>
<div>
<p>What the release notes don&#8217;t share, though, is what we plan on putting into upcoming CDH3 releases. Cloudera is working hard with the rest of the Apache community to deliver additional features in CDH3 including the following noteworthy items:</p>
</div>
<ul>
<li>A new <a href="http://pig.apache.org">Pig</a> package based on the latest Apache release.</li>
<li><a href="http://hbase.apache.org">HBase</a> and <a href="http://zookeeper.apache.org">ZooKeeper</a>, previously only supported&#160;as part of our contrib repository, will become first class packages in CDH3.</li>
<li><a href="http://www.slideshare.net/cloudera/hw09-security-and-api-compatibility">The security work Yahoo! is contributing</a>, which should significantly impact Hadoop adoption.</li>
</ul>
<p>As you&#8217;d expect,&#160; we&#160;will continue to test and integrate many other improvements, bug fixes and features throughout the release. Please check out <a href="http://archive.cloudera.com/docs/cdh.html">the beta</a> and <a href="http://getsatisfaction.com/cloudera/products/cloudera_cloudera_s_distribution_for_hadoop">tell us what you think</a>!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/03/cdh3-beta1-now-available/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>CDH2 is released</title>
		<link>http://www.cloudera.com/blog/2010/03/cdh2-is-released/</link>
		<comments>http://www.cloudera.com/blog/2010/03/cdh2-is-released/#comments</comments>
		<pubDate>Wed, 24 Mar 2010 14:59:22 +0000</pubDate>
		<dc:creator>Chad Metcalf</dc:creator>
				<category><![CDATA[distribution]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[pig]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=2889</guid>
		<description><![CDATA[We&#8217;re proud to announce that Cloudera&#8217;s Distribution for Hadoop Version 2 (CDH2) is officially released. We&#8217;ve come a long way to get to a production quality release. At the beginning of September we announced the first beta of CDH2. After 6 months of additional testing we announced a release candidate. The release candidate spent over [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;re proud to announce that <a href="http://archive.cloudera.com/docs/cdh.html">Cloudera&#8217;s Distribution for Hadoop Version 2</a> (CDH2) is officially released.</p>
<p>We&#8217;ve come a long way to get to a production quality release. At the beginning of September we announced <a href="http://www.cloudera.com/blog/2009/09/cdh2-clouderas-distribution-for-hadoop-2/">the first beta</a> of CDH2. After 6 months of additional testing we <a href="../blog/2010/02/cdh2-testing-heading-towards-stable/">announced a release candidate</a>. The release candidate spent over a month hardening in Cloudera&#8217;s internal QA process and on a wide variety of customer clusters. CDH2 is now stable and ready for use &#8211; we are pleased to recommend it to all our production users.</p>
<p>CDH2 is based on Apache Hadoop 0.20 &#8211; a release that has been available for almost a year. During this time, the Apache Hadoop community has produced hundreds of bug fixes, improvements and features. Cloudera is proud to have contributed many of these and&#160;incorporated them into CDH2. &#160;For more information, please review the following resources:</p>
<ul>
<li><a href="http://archive.cloudera.com/cdh/2/hadoop-0.20.1+169.68.releasenotes.html">The release notes</a> for CDH2. All bug fixes and improvements are covered in detail.</li>
<li>For new features you&#8217;ll want to checkout CDH3, <a href="../blog/2010/03/cdh3-beta1-now-available/">which is now in beta</a>.</li>
<li>For how to get started, please have a look at <a href="http://archive.cloudera.com/docs/_choosing_a_version.html">CDH documentation</a>, which includes a helpful bit on determining which version is right for you.</li>
</ul>
<p>Hadoop is a community effort. We&#8217;d like to thank everyone who contributes to Hadoop, especially the substantial contribution made by the big team at Yahoo! and all the other users who have contributed to this release. We appreciate the feedback on <a href="http://getsatisfaction.com/cloudera/products/cloudera_cloudera_s_distribution_for_hadoop">Get Satisfaction</a>, <a href="http://twitter.com/cloudera">twitter</a> and <a href="http://webchat.freenode.net/?channels=cloudera">IRC</a> (#cloudera on freenode.net). Keep it coming, and thanks for using Cloudera&#8217;s Distribution for Hadoop!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/03/cdh2-is-released/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

