<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; Blog</title>
	<atom:link href="http://www.cloudera.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Fri, 03 Sep 2010 14:00:30 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Tracing with Avro</title>
		<link>http://www.cloudera.com/blog/2010/09/tracing-with-avro/</link>
		<comments>http://www.cloudera.com/blog/2010/09/tracing-with-avro/#comments</comments>
		<pubDate>Fri, 03 Sep 2010 14:00:30 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[Avro]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4639</guid>
		<description><![CDATA[Written by Patrick Wendell, an amazing summer intern with Cloudera and an Avro Committer.
  
 In my summer internship project at Cloudera, I added RPC tracing as a first-order feature of Apache Avro. Avro is a platform for data storage and exchange that caters to data-intensive, dynamic applications. My project focused on Avro’s RPC [...]]]></description>
			<content:encoded><![CDATA[<p><em><strong>Written by Patrick Wendell, an amazing summer intern with Cloudera and an Avro Committer.</strong></em></p>
<p><em><strong> </strong></em><em><strong> </strong></em></p>
<p><em><strong> </strong></em>In my summer internship project at Cloudera, I added RPC tracing as a first-order feature of Apache Avro. Avro is a platform for data storage and exchange that caters to data-intensive, dynamic applications. My project focused on Avro’s RPC functionality.</p>
<p>It is common knowledge that tracing in distributed systems can be difficult. In user-facing web services, a front-end function may recursively trigger several function calls to mid and back-tier services. In offline processing, data-center storage layers may distribute data across several hosts, querying one or many of them when a client requests a file. In either case, the inter-dependency of components makes it difficult to pinpoint the source of a slowdown or hang-up when they inevitably occur.</p>
<div>
<p>AvroTrace is designed as a first responder for diagnosing problems in distributed systems that use Avro for RPC transport. It has two components, a real-time monitoring dashboard and an offline trace analyzer. Both run as low-overhead Avro plugins which store and propagate tracing meta-data among RPC clients and servers. The monitoring dashboard is accessible via a web interface on any Avro server, delivering a “snapshot” of the most recent RPC activity. The offline analysis tool offers a basic interface for collecting, aggregating, and analyzing this data to identify problem spots. It is largely based on <a href="http://research.google.com/pubs/pub36356.html"><span style="font-weight: normal"><span style="font-style: normal">Google’s Dapper</span></span></a><span style="font-weight: normal"><span style="font-style: normal"> tracing infrastructure, which is itself inspired by </span></span><a href="http://www.x-trace.net/wiki/doku.php"><span style="font-weight: normal"><span style="font-style: normal">X-Trace</span></span></a><span style="font-weight: normal"><span style="font-style: normal"> and other academic tracing research.</span></span></p>
<p>Below is an example trace analysis of a recursive RPC call pattern. In the example application,  one remote call, getFile() triggers two other RPC’s, getFileContents() and getFileMeta(). Avro’s tracing has detected this particular pattern and offers a dashboard view summarizing average timing and payload data. It is also showing detailed graphs for one of the specific nodes in this pattern, getFileContents() presenting a visual history of timing (top) and payload (bottom) analytics.</p>
<p>Turnkey tracing is just one of many reasons to use Avro.  I recently became a committer on the Avro project and I look forward to supporting and improving trace functionality in the coming months!</p>
<p style="text-align: center"><a href="http://www.cloudera.com/wp-content/uploads/2010/09/Untitled.png"><img class="aligncenter size-full wp-image-4657" src="http://www.cloudera.com/wp-content/uploads/2010/09/Untitled.png" alt="" width="700" /></a><em> </em></p>
<h5 style="text-align: center"><em>*Click on any of the graphs or stats for a larger version</em></h5>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/09/tracing-with-avro/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Infochimp&#8217;s President, Philip Kromer, Interviewed Regarding Hadoop and Hadoop World</title>
		<link>http://www.cloudera.com/blog/2010/09/infochimps-president-philip-kromer-interviewed-regarding-hadoop-and-hadoop-world/</link>
		<comments>http://www.cloudera.com/blog/2010/09/infochimps-president-philip-kromer-interviewed-regarding-hadoop-and-hadoop-world/#comments</comments>
		<pubDate>Thu, 02 Sep 2010 14:00:06 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4632</guid>
		<description><![CDATA[Excitement is building as Hadoop World nears and we are sitting down with some of our presenters to ask them a few questions regarding their presentations and how they are using Hadoop within their organization. Here we speak with Philip Kromer, President of Infochimps, who  answers  questions regarding his presentation, how Hadoop is used in [...]]]></description>
			<content:encoded><![CDATA[<p>Excitement is building as Hadoop World nears and we are sitting down with some of our presenters to ask them a few questions regarding their presentations and how they are using Hadoop within their organization. Here we speak with Philip Kromer, President of <a href="http://infochimps.org/">Infochimps</a>, who  answers  questions regarding his presentation, how Hadoop is used in his business, and what he aims to get out of <a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/">Hadoop World</a>. Philip’s presentation at Hadoop World is about the development of a data marketplace and commoditization, and their chimpanzee-style approach to data processing. Attend<a href="http://hadoopworld2010.eventbrite.com/"> Hadoop World</a> October 12<sup>th</sup> in New York to hear more from and to talk to Philip.</p>
<h2><strong>What can attendees expect learn about Hadoop from your presentation at Hadoop World?</strong></h2>
<p>We&#8217;re now able to quantify aspects of human behavior never before accessible. Twitter, the News stream, the Smart Grid, are exquisite lab instruments for measuring &#8216;Conversation&#8217;, &#8216;Interest&#8217;, &#8216;Activity&#8217;. What&#8217;s more, with enough data machine-learning algorithms and big data tools let us expose insight using only the *structure*, not the content of the data. The massive quantity and connectivity required demands industrial-strength tools such as Hadoop.</p>
<p>We do *all* our data processing in high level tools (chiefly Pig and Wukong) &#8212; &#8220;black boxes with flexible glue&#8221;. We use &#8216;programmer fun&#8217; + &#8216;programmer time&#8217; as our primary development  metrics. Together, writing simple loosely coupled scripts lets us run the fast experiment-driven design cycles that a lean startup demands. It has also let us grow our own talent and recruit outside CS (physicists, in particular, dream in map reduce). I think this approach should have strong appeal to small- and medium-sized businesses, or anyone looking for low barrier-to-adoption of Hadoop.</p>
<h2><strong>Do you have Hadoop in production use today? </strong></h2>
<p>We have Hadoop in heavy production use for ad-hoc analysis and for automated processes digesting terabytes of data.</p>
<p><strong>Can you describe some use cases for Hadoop in your business?</strong></p>
<p>We have scraped data from around the web, principally Social Networks. We use Hadoop for processing it on its own and to mash it up with other open &amp; commercial datasets.</p>
<p>Examples:</p>
<ul>
<li>We have a collection of 3 billion tweets (twitter messages) from 60+million users that we tokenize into 16B+ usages of 65M terms &#8212; more than a terabyte of data on its own. Using Pig and Wukong we can identify whom to follow, to understand how events and news stories resonate, and even to find dates.</li>
<li>MLB has released a dataset describing the trajectory and full game state for every pitch of every game for the past several seasons.  Smashing this against the hourly weather data produces a laboratory able with the potential to describe the physics of a knuckleball or the performance for pitcher&#8217;s age vs. game-time temperature.</li>
</ul>
<h2><strong>How do you support Hadoop?</strong></h2>
<p>Operationally,  we use the Amazon cloud and a collection of Chef recipes (that we&#8217;ve open-sourced). These let us spin up, use, and spin down clusters of one to hundreds of machines, using either local (persistent) HDFS or just push/pull from Amazon S3.</p>
<p>We have also been supporting Hadoop by giving back to the Hadoop open-source community.</p>
<ul>
<li>Wukong (our Ruby-language toolkit for Hadoop), which we believe is the easiest and most fun way to write map-reduce programs.</li>
<li>At Hadoop World we&#8217;ll be announcing Chimpmark, a target benchmark for implementers and users of big data tools. It&#8217;s a collection of large scale datasets, accompanying challenges, and reference implementations that let you profile, tune and more deeply understand your hadoop system.</li>
<li>ClusterChef, the cluster management toolkit I described above.</li>
</ul>
<h2><strong>How has Hadoop improved your business?</strong></h2>
<p>Most of the stuff we use Hadoop for would be otherwise impossible.</p>
<h2><strong>What are you hoping to get out of your time at Hadoop World?</strong></h2>
<ul>
<li>Learn Ideas.<strong></strong></li>
<li>Popularize and receive feedback on the development of a data marketplace.<strong></strong></li>
<li>Hear where the world of Big Data is going.<strong></strong></li>
</ul>
<p style="text-align: center">At <a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/">Hadoop World</a> you can hear more from Philip Kromer as well as any of the thirty-five other presenters! <a href="http://hadoopworld2010.eventbrite.com/">Click here to register right away!</a><br />
<a href="http://hadoopworld2010.eventbrite.com/"><img class="size-full wp-image-4403  aligncenter" src="http://www.cloudera.com/wp-content/uploads/2010/08/hw_white2.gif" alt="" width="169" height="130" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/09/infochimps-president-philip-kromer-interviewed-regarding-hadoop-and-hadoop-world/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Register for Hadoop Training in New York and Get into Hadoop World for Free!</title>
		<link>http://www.cloudera.com/blog/2010/09/register-for-hadoop-training-in-new-york-and-get-into-hadoop-world-for-free/</link>
		<comments>http://www.cloudera.com/blog/2010/09/register-for-hadoop-training-in-new-york-and-get-into-hadoop-world-for-free/#comments</comments>
		<pubDate>Wed, 01 Sep 2010 14:00:25 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[training]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4618</guid>
		<description><![CDATA[That’s right, sign up for any of the training courses surrounding Hadoop World 2010, and receive a complimentary pass to the conference! There are seven different courses on offer, so whether you are new to Hadoop or looking to deepen your skills, you’ll find something to fit your needs.
If you are a manager trying to [...]]]></description>
			<content:encoded><![CDATA[<p>That’s right, sign up for any of the <a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/training/">training courses</a> surrounding Hadoop World 2010, and receive a complimentary pass to the conference! There are seven different courses on offer, so whether you are new to Hadoop or looking to deepen your skills, you’ll find something to fit your needs.</p>
<p>If you are a manager trying to decide whether Hadoop is an appropriate technology for your organization, <a href="http://www.eventbrite.com/event/762237874">Hadoop Essentials for Managers</a> will answer your questions. We will show you when using Hadoop is appropriate, what Hadoop is being used for in a range of industries, how Hadoop fits into your existing environment and what you need to know in order to deploy it within your organization.</p>
<p>Why not turn your Hadoop World trip into a multiple day Hadoop learning extravaganza by attending one of our two-day sessions? Both the <a href="http://www.eventbrite.com/event/762320120">developer</a> and <a href="http://www.eventbrite.com/event/762677188">administrator</a> training courses culminate in an exam which, when passed, confers Cloudera Certified Hadoop Developer or Administrator status.</p>
<p>For the developer with an existing understanding of Hadoop and ready to utilize Hive and Pig for their data analysis, there is a <a href="http://www.eventbrite.com/event/762318114">two-day class</a> teaching you how to process data using filters, joins, user-defined functions and more.</p>
<p>For those looking to deploy HBase, consider our one-day HBase <a href="http://www.eventbrite.com/event/762317111">training session</a>. Learn how to use HBase as a distributed data store to achieve low-latency queries and highly scalable throughput. This class covers HBase architecture, data modeling, and the Java API as well as some advanced topics and best practices.</p>
<p>If you’re a developer who is completely new to Hadoop, we have put together a <a href="http://www.eventbrite.com/event/762326138">course</a> that will provide you with a solid foundation in large scale data processing using MapReduce and Hadoop. This course is purposely offered the day before Hadoop World, so that while in attendance you will be able to better grasp the topics at the conference with your fresh Hadoop knowledge. Once you have taken this course and are comfortable with Hadoop, feel free to also enroll in a <a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/training/">training</a> course followed by Certification to document your new-found Hadoop knowledge.</p>
<p>For developers who wish to simplify interacting with Hadoop, <a href="http://www.eventbrite.com/event/764021208">Cloudera HUE</a> provides back- and front-end APIs to deliver a rich, web-based, graphical user experience. This <a href="http://www.eventbrite.com/event/764021208">class</a> covers using the HUE APIs to develop your own rich, graphical applications built on top of the HUE platform.</p>
<p>Once again, you will receive free entry to <a href="http://www.eventbrite.com/event/764021208">Hadoop World</a> if you are registered in any of the <a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/training/">training sessions</a> surrounding the event! Don’t miss out on this opportunity to broaden your knowledge, and we hope to see you there!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/09/register-for-hadoop-training-in-new-york-and-get-into-hadoop-world-for-free/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop World 2010: Speaker Highlights</title>
		<link>http://www.cloudera.com/blog/2010/08/hadoop-world-2010-speaker-highlights/</link>
		<comments>http://www.cloudera.com/blog/2010/08/hadoop-world-2010-speaker-highlights/#comments</comments>
		<pubDate>Mon, 30 Aug 2010 15:00:58 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hadoopworld]]></category>
		<category><![CDATA[training]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4436</guid>
		<description><![CDATA[Hadoop is increasingly being adopted by many Fortune 500 enterprises. Some of the speakers featured at Hadoop World this year include leading companies who have been able to create new value for their business using Hadoop. The presentations at Hadoop World are focused on how Hadoop is solving business problems for these enterprises.  Below are [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify">Hadoop is increasingly being adopted by many Fortune 500 enterprises. Some of the speakers featured at Hadoop World this year include leading companies who have been able to create new value for their business using Hadoop. The presentations at Hadoop World are focused on how Hadoop is solving business problems for these enterprises.  Below are three examples of leading enterprises that will present how Hadoop has impacted their businesses.</p>
<p style="text-align: center"><a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/"><img class="size-full wp-image-4403 aligncenter" src="http://www.cloudera.com/wp-content/uploads/2010/08/hw_white2.gif" alt="" width="169" height="130" /></a></p>
<p style="text-align: justify"><strong><a href="http://www.ge.com">GE, Product Manager, Linden Hillenbrand</a></strong>, will be talking about how Hadoop has improved GE’s Marketing &amp; Communications functions.  One capability GE has implemented is assessing the external perception of GE&#8211;positive, neutral, or negative&#8211;through various marketing campaigns.</p>
<p style="text-align: justify"><strong><a href="http://www.bofa.com">Managing Director, Big Data &amp; Analytics at Bank of America, Abhishek Mehta</a></strong>, will  present “The Business of Big Data.” This presentation will discuss how an organization with established and legacy infrastructure, technology and business processes can adopt Hadoop technologies and processes to find groundbreaking solutions to known problems.</p>
<p style="text-align: justify"><strong><a href="http://www.ebay.com">eBay Engineering director of Analytical Platform Development, Anil Madan</a></strong>, is presenting “Hadoop at eBay.” One of eBay’s largest assets is the large amount of user data they have collected. By sourcing huge volumes of this data into the HDFS cluster and running click stream and transactional data analysis eBay gets a better understanding of user behavior as well as search quality.</p>
<p style="text-align: justify">Hadoop World is a great way to learn how Hadoop is being used to power today’s modern enterprises. These presentations will help you understand how Hadoop improves your data storage and processing environment and directly impacts your business.</p>
<p style="text-align: justify">Don’t miss out! <a href="http://hadoopworld2010.eventbrite.com/">Register now</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/hadoop-world-2010-speaker-highlights/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What’s New in Apache Hadoop 0.21</title>
		<link>http://www.cloudera.com/blog/2010/08/what%e2%80%99s-new-in-apache-hadoop-0-21/</link>
		<comments>http://www.cloudera.com/blog/2010/08/what%e2%80%99s-new-in-apache-hadoop-0-21/#comments</comments>
		<pubDate>Thu, 26 Aug 2010 23:53:29 +0000</pubDate>
		<dc:creator>Tom White</dc:creator>
				<category><![CDATA[general]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4519</guid>
		<description><![CDATA[Apache Hadoop 0.21.0 was released on August 23, 2010. The last major release was 0.20.0 in April last year, so it&#8217;s not surprising that there are so many changes in this release, given the amount of activity in the Hadoop development community. In fact, there were over 1300 issues fixed in JIRA (Common, HDFS, MapReduce), [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://hadoop.apache.org/common/docs/r0.21.0/">Apache Hadoop 0.21.0</a> was released on August 23, 2010. The last major release was 0.20.0 in April last year, so it&#8217;s not surprising that there are so many changes in this release, given the amount of activity in the Hadoop development community. In fact, there were over 1300 issues fixed in JIRA (<a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;mode=hide&amp;sorter/order=DESC&amp;sorter/field=priority&amp;pid=12310240&amp;fixfor=12313563">Common</a>, <a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;mode=hide&amp;sorter/order=DESC&amp;sorter/field=priority&amp;pid=12310942&amp;fixfor=12314046">HDFS</a>, <a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;mode=hide&amp;sorter/order=DESC&amp;sorter/field=priority&amp;pid=12310941&amp;fixfor=12314045">MapReduce</a>), the issue tracker used for Apache Hadoop development. Bear in mind that the 0.21.0 release, like all dot zero releases, isn&#8217;t suitable for production use.</p>
<p>With such a large delta from the last release, it is difficult to grasp the important new features and changes. This post is intended to give a high-level view of some of the more significant features introduced in the 0.21.0 release. Of course, it can&#8217;t hope to cover everything, so please consult the release notes (<a href="http://hadoop.apache.org/common/docs/r0.21.0/releasenotes.html">Common</a>, <a href="http://hadoop.apache.org/hdfs/docs/r0.21.0/releasenotes.html">HDFS</a>, <a href="http://hadoop.apache.org/mapreduce/docs/r0.21.0/releasenotes.html">MapReduce</a>) and the change logs (<a href="http://hadoop.apache.org/common/docs/r0.21.0/changes.html">Common</a>, <a href="http://hadoop.apache.org/hdfs/docs/r0.21.0/changes.html">HDFS</a>, <a href="http://hadoop.apache.org/mapreduce/docs/r0.21.0/changes.html">MapReduce</a>) for the full details. Also, please let us know in the comments of any features, improvements, or bug fixes that you are excited about.</p>
<p>You can download Hadoop 0.21.0 from an <a href="http://www.apache.org/dyn/closer.cgi/hadoop/core/">Apache Mirror</a>. Thanks to everyone who contributed to this release!</p>
<p><span id="more-4519"></span></p>
<h2>Project Split</h2>
<p>Organizationally, a significant chunk of work has arisen from the project split, which transformed a single Hadoop project (called Core) into three constituents: <a href="http://hadoop.apache.org/common">Common</a>, <a href="http://hadoop.apache.org/hdfs">HDFS</a>, and <a href="http://hadoop.apache.org/mapreduce">MapReduce</a>. HDFS and MapReduce both have dependencies on Common, but (other than for running tests) MapReduce has no dependency on HDFS. This separation emphasizes the fact that MapReduce can run on alternative distributed file systems (although HDFS is still the best choice for sheer throughput and scalability), and it has made following development easier since there are now separate lists for each subproject. There is one release tarball still, however, although it is laid out a little differently from previous releases, since it has a subdirectory containing each of the subproject source files.</p>
<p>From a user&#8217;s point of view little has changed as a result of the split. The configuration files are divided into <em>core-site.xml</em>, <em>hdfs-site.xml</em>, and <em>mapred-site.xml</em> (this was supported in 0.20 too), and the control scripts are now broken into three (<a href="https://issues.apache.org/jira/browse/HADOOP-4868">HADOOP-4868</a>): in addition to the <em>bin/hadoop</em> script, there is a <em>bin/hdfs</em> script and a <em>bin/mapreduce</em> script for running HDFS and MapReduce daemons and commands, respectively. The <em>bin/hadoop</em> script still works as before, but issues a deprecation warning. Finally, you will need to set the <code>HADOOP_HOME</code> environment variable to have the scripts work smoothly.</p>
<h2>Common</h2>
<p>The 0.21.0 release is technically a minor release (traditionally Hadoop 0.x releases have been major, and have been allowed to <a href="http://wiki.apache.org/hadoop/Roadmap">break compatibility</a> with the previous 0.x-1 release) so it is API compatible with 0.20.2. To make the intended stability and audience of a particular API in Hadoop clear to users, all Java members with public visibility have been marked with <strong>classification annotations</strong> to say whether they are <code>Public</code>, or <code>Private</code> (there is also <code>LimitedPrivate</code> which signifies another, named, project may use it), and whether they are <code>Stable</code>, <code>Evolving</code>, or <code>Unstable</code> (<a href="https://issues.apache.org/jira/browse/HADOOP-5073">HADOOP-5073</a>). Only elements marked as <code>Public</code> appear in the user Javadoc (<a href="http://hadoop.apache.org/common/docs/r0.21.0/api/index.html">Common<a>, <a href="http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/index.html">MapReduce</a>; note that HDFS is all marked as private since it is accessed through the <code>FileSystem</code> interface in Common). The classification interface is descibed in detail in <a href="http://developer.yahoo.net/blogs/hadoop/2010/05/towards_enterpriseclass_compat.html">Towards Enterprise-Class Compatibility for Apache Hadoop</a> by Sanjay Radia.</p>
<p>This release has seen some significant improvements to <strong>testing</strong>. The <strong>Large-Scale Automated Test Framework</strong>, known as Herriot (<a href="https://issues.apache.org/jira/browse/HADOOP-6332">HADOOP-6332</a>), allows developers to <a href="http://wiki.apache.org/hadoop/HowToUseSystemTestFramework">write tests</a> that run against a real (possibly large) cluster. While there are only a dozen or so tests at the moment, the intention is that more tests will be written over time so that regression tests can be shared and run against new Hadoop release candidates, thereby making Hadoop upgrades more predictable for users.</p>
<p>Hadoop 0.21 also introduces a <strong>fault injection framework</strong>, which uses AOP to <a href="http://hadoop.apache.org/hdfs/docs/r0.21.0/faultinject_framework.html">inject faults</a> into a part of the system that is running under test (e.g. a datanode), and asserts that the system reacts to the fault in the expected manner. Complementing fault injection is mock object testing, which tests code &#8220;in the small&#8221;, at the class-level rather than the system-level. Hadoop has a growing number of <strong>Mockito-based tests</strong> for this purpose (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-1050">MAPREDUCE-1050</a>). </p>
<p>Among the many other improvements and new features, a couple of small ones stand out: the ability to <strong>retrieve metrics and configuration</strong> from Hadoop daemons by accessing the URLs <em>/metrics</em> and <em>/conf</em> in a browser (<a href="https://issues.apache.org/jira/browse/HADOOP-5469">HADOOP-5469</a>, <a href="https://issues.apache.org/jira/browse/HADOOP-6408">HADOOP-6408</a>).</p>
<h2>HDFS</h2>
<p>Support for <strong>appends</strong> in HDFS has had a rocky history. The feature was introduced in the 0.19.0 release, and then disabled in 0.19.1 due to <a href="https://issues.apache.org/jira/browse/HADOOP-5224">stability issues</a>. The good news is that the append call is back in 0.21.0 with a brand new implementation (<a href="https://issues.apache.org/jira/browse/HDFS-265">HDFS-265</a>), and may be accessed via <code>FileSystem</code>&#8217;s <code>append()</code> method. Closely related&mdash;and more interesting for many applications, such as HBase&mdash;is the <code>Syncable</code> interface that <code>FSDataOutputStream</code> now implements, which brings sync semantics to HDFS (<a href="https://issues.apache.org/jira/browse/HADOOP-6313">HADOOP-6313</a>).</p>
<p>Hadoop 0.21 has a <strong>new filesystem API</strong>, called <code>FileContext</code>, which makes it easier for applications to work with multiple filesystems (<a href="https://issues.apache.org/jira/browse/HADOOP-4952">HADOOP-4952</a>). The API is not in widespread use yet (e.g. it is not integrated with MapReduce), but it has some features that the old <code>FileSystem</code> interface doesn&#8217;t, notably support for <strong>symbolic links</strong> (<a href="https://issues.apache.org/jira/browse/HADOOP-6421">HADOOP-6421</a>, <a href="https://issues.apache.org/jira/browse/HDFS-245">HDFS-245</a>).</p>
<p>The <strong>secondary namenode has been deprecated</strong> in 0.21. Instead you should consider running a <a href="http://hadoop.apache.org/hdfs/docs/r0.21.0/hdfs_user_guide.html#Checkpoint+Node">checkpoint node</a> (which essentially acts like a secondary namenode) or a <a href="http://hadoop.apache.org/hdfs/docs/r0.21.0/hdfs_user_guide.html#Backup+Node">backup node</a> (<a href="https://issues.apache.org/jira/browse/HADOOP-4539">HADOOP-4539</a>). By using a backup node you no longer need an NFS-mount for namenode metadata, since it accepts a stream of filesystem edits from the namenode, which it writes to disk.</p>
<p>New in 0.21 is the <strong>offline image viewer</strong> (oiv) for HDFS image files (<a href="https://issues.apache.org/jira/browse/HADOOP-5467">HADOOP-5467</a>). This <a href="http://hadoop.apache.org/hdfs/docs/r0.21.0/hdfs_imageviewer.html">tool</a> allows admins to analyze HDFS metadata without impacting the namenode (it also works with older versions of HDFS). There is also a <strong>block forensics tool</strong> for finding corrupt and missing blocks from the HDFS logs (<a href="https://issues.apache.org/jira/browse/HDFS-567">HDFS-567</a>).</p>
<p>Modularization continues in the platform with the introduction of <strong>pluggable block placement</strong> (<a href="https://issues.apache.org/jira/browse/HDFS-385">HDFS-385</a>), an expert-level interface for developers who want to try out new placement algorithms for HDFS. </p>
<p>Other notable new features include:</p>
<ul>
<li>Support for efficient <strong>file concatenation in HDFS</strong> (<a href="https://issues.apache.org/jira/browse/HDFS-222">HDFS-222</a>)</li>
<li><strong>Distributed RAID filesystem</strong> (<a href="https://issues.apache.org/jira/browse/HDFS-503">HDFS-503</a>) &#8211; an erasure coding filesystem running on HDFS, designed for archival storage since the replication factor is reduced from 3 to 2, while keeping the likelihood of data loss about the same. (Note that the RAID code is a MapReduce contrib module since it has a dependency on MapReduce for generating parity blocks.)</li>
</ul>
<h2>MapReduce</h2>
<p>The biggest user-facing change in MapReduce is the status of the <strong>new API</strong>, sometimes called &#8220;context objects&#8221;. The new API is now more broadly supported since the MapReduce libraries (in <code>org.apache.hadoop.mapreduce.lib</code>) have been ported to use it (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-334">MAPREDUCE-334</a>). The examples all use the new API too (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-271">MAPREDUCE-271</a>). Nevertheless, to give users more time to migrate to the new API, the old API has been un-deprecated in this release (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-1735">MAPREDUCE-1735</a>), which means that existing programs will compile without deprecation warnings.</p>
<p>The <code>LocalJobRunner</code> (for trying out MapReduce programs on small local datasets) has been enhanced to make it more like running MapReduce on a cluster. It now supports the <strong>distributed cache</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-476">MAPREDUCE-476</a>), and can <strong>run mappers in parallel</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-1367">MAPREDUCE-1367</a>).</p>
<p>Distcp has seen a number of small improvements too, such as <strong>preserving file modification times</strong> (<a href="https://issues.apache.org/jira/browse/HADOOP-5620">HADOOP-5620</a>), <strong>input file globbing</strong> (<a href="https://issues.apache.org/jira/browse/HADOOP-5472">HADOOP-5472</a>), and <strong>preserving the source path</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-642">MAPREDUCE-642</a>).</p>
<p>Continuing the testing theme, this release is the first to feature <strong>MRUnit</strong>, a contrib module that helps users write unit tests for their MapReduce jobs (<a href="https://issues.apache.org/jira/browse/HADOOP-5518">HADOOP-5518</a>).</p>
<p>Other new contrib modules include <strong>Rumen</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-751">MAPREDUCE-751</a>) and <strong>Mumak</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-728">MAPREDUCE-728</a>), tools for modelling MapReduce. The two are designed to work together: Rumen extracts job data from historical logs, which Mumak then uses to simulate MapReduce applications and clusters on a cluster. <a href="http://developer.yahoo.net/blogs/hadoop/2010/04/gridmix3_emulating_production.html">Gridmix3</a> is also designed to work with Rumen traces. The <strong>job history log analyzer</strong> is another tool that gives information about MapReduce cluster utilization (<a href="https://issues.apache.org/jira/browse/HDFS-459">HDFS-459</a>).</p>
<p>On the job scheduling front there have been updates to the Fair Scheduler, including <strong>global scheduling</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-548">MAPREDUCE-548</a>), <strong>preemption</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-551">MAPREDUCE-551</a>), and support for <strong>FIFO pools</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-706">MAPREDUCE-706</a>). Similarly, the Capacity Scheduler now supports <strong>hierarchical queues</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-824">MAPREDUCE-824</a>), and admin-defined <strong>hard limits</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-532">MAPREDUCE-532</a>). There is also a brand new scheduler, the Dynamic Priority Scheduler, which <a href="https://issues.apache.org/jira/browse/HADOOP-4768?focusedCommentId=12763348&amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12763348">dynamically changes queue shares using a pricing model</a> (<a href="https://issues.apache.org/jira/browse/HADOOP-4768">HADOOP-4768</a>).</p>
<p><strong>Smarter speculative execution</strong> has been added to all schedulers using a more robust algorithm, called <a href="http://www.usenix.org/event/osdi08/tech/full_papers/zaharia/zaharia_html/">Longest Approximate Time to End (LATE)</a> (<a href="https://issues.apache.org/jira/browse/HADOOP-2141">HADOOP-2141</a>).</p>
<p>Finally, a couple of smaller changes:</p>
<ul>
<li><strong>Streaming combiners</strong> are now supported, so that the <code>-combiner</code> option may specify any streaming script or executable, not just a Java class. (<a href="https://issues.apache.org/jira/browse/HADOOP-4842">HADOOP-4842</a>)</li>
<li>On the successful completion of a job, the MapReduce runtime creates a <strong><em>_SUCCESS</em> file</strong> in the output directory. This may be useful for applications that need to see if a result set is complete just by inspecting HDFS. (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-947">MAPREDUCE-947</a>)</li>
</ul>
<h2>What&#8217;s Not In</h2>
<p>Finally, it bears mentioning what didn&#8217;t make it into 0.21.0. The biggest omission is the new Kerberos authentication work from Yahoo! While a majority of the patches are included, security is turned off by default, and is unlikely to work if enabled (certainly there is no guarantee that it will provide any level of security, since it is incomplete). A full working security implementation will be available in 0.22, and also the next version of <a href="http://www.cloudera.com/hadoop/">CDH</a>.</p>
<p>Also, Sqoop, which was initially developed as a Hadoop contrib module, is not in 0.21.0, since it was moved out to become a standalone open source project <a href="http://wiki.github.com/cloudera/sqoop/">hosted on github</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/what%e2%80%99s-new-in-apache-hadoop-0-21/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Hadoop for Fraud Detection and Prevention</title>
		<link>http://www.cloudera.com/blog/2010/08/hadoop-for-fraud-detection-and-prevention/</link>
		<comments>http://www.cloudera.com/blog/2010/08/hadoop-for-fraud-detection-and-prevention/#comments</comments>
		<pubDate>Wed, 25 Aug 2010 05:27:20 +0000</pubDate>
		<dc:creator>Alex Kozlov</dc:creator>
				<category><![CDATA[HDFS]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[fraud]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4478</guid>
		<description><![CDATA[Learn about fraud and how to prevent it with Hadoop]]></description>
			<content:encoded><![CDATA[<p>Fraud has multiple meanings and the term can be easily abused.  The definition of fraud has undergone multiple changes throughout the years and is elusive as well as fraud itself.  The modern legal definition of fraud usually contains a few elements that have to be proven in court and depends on the state/country.  For example, in California, the elements of fraud, which give rise to the fraud cause of action in the California Courts, are: (a) misrepresentation (false representation, concealment, or nondisclosure); (b) knowledge of falsity (or scienter); (c) intent to defraud, i.e., to induce reliance; (d) justifiable reliance; and (e) resulting damage.  A more general definition may contain up to <a href="http://en.wikipedia.org/wiki/Fraud#Elements_of_fraud">9 elements</a>.</p>
<p>
From the statistical or technical perspective, fraud is a rare event that results in a significant financial impact to the organization.</p>
<p>
Both definitions emphasize that the event is rare (assuming that most of the population is law-abiding citizens), is intentional (there is no “accidental” fraud), as well as imply a significant damage caused to the defrauded party (otherwise why bother).  Fraud detection is difficult from statistical point of view for exactly these reasons: (a) the events are rare and it is difficult to build a predictive model and (b) fraud assumes a real human being behind it and incorporates elements of game theory since the fraudster is often an insider who knows how to game the system.</p>
<p><h3>Fraud and Rare Events</h3>
<p>By definition, fraud is an unexpected or rare event with significant financial or other damage.  Fraud assumes that the fraudster has some prior information how the current system works including previous successful and unsuccessful fraud cases and possibly the fraud detection mechanisms.  The above breaks the standard statistical modeling assumption, the variable independence or i.i.d. assumption, making building a reliable statistical model difficult.  Often the fraudster is working in the same industry that the fraud detection is supposed to protect, is intimately familiar with the fraud detection methods, and is actively trying to avoid detection by masquerading.</p>
<p>
Rare event detection problem is also applicable to online advertising and marketing, particularly with predicting “long tail” events and terrorism detection.</p>
<p>
One common example of fraud is associated with <a href="http://en.wikipedia.org/wiki/Taleb_distribution" target="_blank">Taleb distribution</a> where a seemingly high probability of a small gain shadows a small probability of a large loss that more than outweighs the gains.  Relatively long periods of slightly better than moderate gains are interrupted by a rare event of large losses.  It is easy to defraud investors by presenting the results of partial analysis excluding the “rare events”.</p>
<p><h3>Fraud Prevention</h3>
<p>Since fraud is so hard to prove in courts, most organizations and individuals try to prevent fraud from happening by blanket measures.  This includes limiting the amount of damage the fraudster can impact on the organization as well as early detection of fraud patterns.  For example, credit card companies can cut the credit card limit across the board in anticipation of a few negative fraud cases.  Advertisers can prevent advertising campaigns with low number of qualifying events.  And anti-terrorism agencies can prevent people with bottles of pure water from boarding the planes.  These actions are often in contrast with the company efforts to attract more customers and result in general dissatisfaction.  To the rescue are new technologies like Hadoop, Influence Diagrams and Bayesian Networks which are computationally expensive (these are NP-hard in computer science terminology) but are more accurate and predictive.</p>
<p><h3>Why Hadoop?</h3>
<p>Hadoop is a distributed system for processing large amounts of data.  In a recent Hadoop Summit 2010 Yahoo, Facebook, and other companies announced that they currently process a few TBs of data per day and the volumes are <a href="http://developer.yahoo.net/blogs/theater/archives/2009/06/hadoopsummit_omalley.html" target="_blank">growing at exponential rates</a>.  Hadoop can be vital for solving the fraud detection problem because:</p>
<ol>
<li>Sampling      does not work for rare events since the chance of missing a positive fraud      case leads to significant deterioration of model quality.</li>
<li>Hadoop      can solve much harder problems by leveraging multiple cores across      thousands of machines and search through much larger problem domains.</li>
<li>Hadoop      can be combined with other tools to manage moderate to low response      latency requirements.</li>
</ol>
<p>
Let’s go through these reasons one by one.  Sampling is a common technique for modeling rare events.  One of the problems with sampling is that we cannot afford to throw away rare positive cases.  Even in a stratified or proportional sampling scheme one has to retain all positive cases since the model accuracy heavily depends on them (one can usually discard some negative cases though).  Given the above, the system still has to go through the whole dataset to sieve through the positive and negative cases.</p>
<p>
Hadoop is known for its gnawing power.  Nothing can compare with the throughput power of thousands of machines each of which has multiple cores.  As was reported recently at the Hadoop Summit 2010, the largest installations of Hadoop have 2,000 to 4,000 computers with 8 to 12 cores each, amounting to up to 48,000 active threads looking for a pattern at the same time.  This allows either (a) looking through larger periods of time to incorporate events across a larger time frame or (b) taking more sources of information into account.  It is quite common among social network companies to comb through twitter blogs in search of relevant data.</p>
<p>
Finally, one of the fraud prevention problems is latency.  The agencies want to react to an event as soon as possible, often within a few minutes of the event.  Yahoo recently reported that it can adjust its behavioral model in a response to a user click event within 5-7 minutes across several hundred of millions of customers and billions of events per day.  Cloudera has developed a tool, Flume, that can load billions of events into HDFS within a few seconds and analyze them using MapReduce.</p>
<p>
Often fraud detection is akin to “finding a needle in a haystack”.  One has to go through mountains of relevant and seemingly irrelevant information, build dependency models, evaluate the impact and thwart the fraudster actions.  Hadoop helps with finding patterns by processing mountains of information on thousands of cores in a relatively short amount of time.</p>
<p><h3>Where to look next?</h3>
<p>Techniques for fraud detection are industry-specific as a rule and often are guarded since they obviously represent valuable information for potential fraudsters.  They are often kept confidential for this reason.  Moreover, the fraud detection techniques are usually a moving target since the fraudsters quickly adjust to the new fraud detection mechanisms.</p>
<p>
One of the most publicized technical frauds is click fraud in on-line advertising.  Since advertisers are often charged on the per-click basis — so called PPC campaigns; there is a way to charge advertisers on a per-conversion basis, which we will cover shortly, but a different type of fraud emerges there where the advertiser tries to conceal the conversions — the traffic provider like a search web site has a clear incentive to inflate the number.  Additionally, an advertiser competitor may be incentivized to inflate the number to skew the original advertiser margin.  This can be achieved by a human or software agent that generates extra traffic and clicks on the competitor site.  Fraud management companies like <a href="http://www.fraudwall.com/" target="_blank">Anchor Intelligence</a> and <a href="http://www.clickforensics.com/" target="_blank">Click Forensics</a> estimate that approximately 20% to 30% of all clicks are fraud.  How do we know that a click is a fraud?</p>
<p>
Decline in the number of conversions — first and most important, if your conversion rate is normally positive (that is, you are making a profit on your ad), and all of a sudden, conversion dives into negative numbers, this could be a sign of click fraud in action.  Click fraud causes extra clicks on your ad with no actual purchases, and your conversion rate will fall accordingly.</p>
<p>
An abnormal number of clicks from the same IP address or a pattern in the access times — although this is the most obvious and easily identified form of click fraud, it is amazing how many fraudsters still use this method, particularly for quick attacks.  They may choose a to strike over a long weekend when they figure you may not be watching your log files carefully, clicking on your ad repeatedly so that when you return to work on Tuesday, your account is significantly depleted.  Part of this fraud might be unintentional when a user tries to reload a page.</p>
<p>
Large “abandonment rate”, or numbers of visitors who leave your site quickly — another indication of click fraud can be a pattern of visitors clicking on your ad, spending the minimum amount of time on your site required by your PPC search engine to establish it as a valid click (usually 30 seconds or more), and then leaving without having left the landing page at all.</p>
<p>
A large number of impressions, without the follow-through clicks or click on your ad — if you notice that there are a lot more impressions (views) of your website; this could indicate the impression fraud we discussed earlier. Artificial inflation of your ad impressions may cause your clickthrough rates to drop below the Google minimum, and your ad will be disabled.  Until you realize this, your competitors have free reign to use your keywords, sometimes at bargain prices.  As well, your relevancy ratings for search engines may drop as they record numerous impressions, but no interest shown via visits to other parts of your website, which could lead to a shutdown of your campaign.</p>
<p>
Abnormally high clicks and impressions on affiliate websites — although affiliates themselves are sometimes involved in conducting click fraud schemes, they can be victims of click fraud themselves.  If one of their competitors uses this same method of excessive clicks and impressions on an affiliate’s site, the PPC search engine will soon notice an abnormally high payment to a certain affiliate and perhaps go as far as canceling that affiliate’s account, even though he or she was not engaging in any form of click fraud.</p>
<p>
A large number of clicks coming from countries outside of your normal market area — using IP geo-location services, you can identify which country an IP address is probably coming from.</p>
<p>
In the case of performance-based advertising, the advertiser himself is interested in concealing some of the traffic, not inflating it.  Since most of the performance-based measurements is based in beacons or pixels placed on the advertiser conversion page, advertiser has an incentive to (temporarily) block the traffic from the beacon or to completely remove it from their web-site.</p>
<p>
Fraud is prevalent in telecom industry.  One of the leading commercially available fraud detection products is <a href="http://h20208.www2.hp.com/cms/solutions/ci-b/cv/frm.jsp" target="_blank">HP FMS system</a> on which the author had a pleasure to work personally.  The types of telecom fraud include:</p>
<p>
Subscription fraud — involves the acquisition of telecommunications services using stolen or false credentials and/or identity with no intention of paying. With subscription fraud, not only do service providers lose revenue, but also individual consumers are vulnerable to having their identity stolen and credit rating tarnished.</p>
<p>
Technical/network fraud — occurs when someone uses equipment or technology to gain access to a service without paying. Fraudulent calls are typically billed to the legitimate owner of the line or service.  Wireless examples include cloning of cell phones or subscriber identity module (SIM) cards. Fixed line examples include clip on or line tapping, private branch exchange (PBX) hacking and calling card fraud. Prepaid services also have a large exposure to fraud with terminal tampering via magnetic strips or SIM chips, or recharging with stolen credit card numbers.</p>
<p>
Insider fraud — occurs when individuals inside the operator provide fraudulent access to networks or otherwise thwart the ability of the operator to be paid for services used.</p>
<p>
Handset abuse — is what takes place when stolen or lost handsets are used to consume telecommunications services that are in turn paid for by the service provider.  This is an expensive liability for carriers who absorb the costs.</p>
<p>
Social engineering — is an effective fraud technique in which people unwittingly help perpetrators by providing sensitive data, illicit access or simply forwarding their calls without ever knowing they have done anything wrong.</p>
<p>
All these patterns can be detected with special MapReduce pattern detection techniques.  Flume offers low-latency stream processing capabilities.</p>
<p>
Needless to say, the fraudsters also explore the potential market and invent new innovative ways to generate fraud.  One of them is deployed by <a href="http://www.clickmonkeys.com/about" target="_blank">Click Monkeys</a> which deploys a vessel with animals next to the coast of California to generate seemingly random traffic.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/hadoop-for-fraud-detection-and-prevention/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Hadoop Administrator Training Comes to London</title>
		<link>http://www.cloudera.com/blog/2010/08/london-hadoop-administrative-training-certificatio/</link>
		<comments>http://www.cloudera.com/blog/2010/08/london-hadoop-administrative-training-certificatio/#comments</comments>
		<pubDate>Tue, 24 Aug 2010 15:00:25 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[HDFS]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[training]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4417</guid>
		<description><![CDATA[Cloudera’s Hadoop Training and Certification for System Administrators has made it across the Atlantic to London for the first time! This two-day course covers planning, deploying, maintaining, monitoring, and troubleshooting your Hadoop cluster. We’ll talk about HDFS, MapReduce, Hive, Pig, HBase, Flume and more, from the System Administrator’s point of view. Take the certification exam [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify">Cloudera’s<a href="http://www.eventbrite.com/directory?q=cloudera&amp;loc=london&amp;page=1"> Hadoop Training and Certification</a> for <a href="http://www.eventbrite.com/event/762684209">System Administrators</a> has made it across the Atlantic to London for the first time! This two-day course covers planning, deploying, maintaining, monitoring, and troubleshooting your Hadoop cluster. We’ll talk about HDFS, MapReduce, Hive, Pig, HBase, Flume and more, from the System Administrator’s point of view. Take the certification exam at the end of your training and go home with a valuable validation of your Hadoop knowledge.</p>
<p style="text-align: justify">
<p style="text-align: justify">Enter the code &#8220;london_10pct&#8221; when <a href="http://www.eventbrite.com/event/762684209">registering</a> and receive a 10% discount!</p>
<p style="text-align: center"><a href="http://www.cloudera.com/what-is-hadoop/"><img class="size-medium wp-image-4448 aligncenter" src="http://www.cloudera.com/wp-content/uploads/2010/08/hadoop+elephant_rgb-300x107.png" alt="" width="370" height="130" /></a></p>
<p style="text-align: justify">Hadoop is a rapidly growing field. Prove your expertise by attaining certification from the world’s foremost Hadoop training and consulting company.</p>
<p style="text-align: justify">.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/london-hadoop-administrative-training-certificatio/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Improving Hotel Search: Hadoop @ Orbitz Worldwide</title>
		<link>http://www.cloudera.com/blog/2010/08/improving-hotel-search-hadoop-orbitz-worldwide/</link>
		<comments>http://www.cloudera.com/blog/2010/08/improving-hotel-search-hadoop-orbitz-worldwide/#comments</comments>
		<pubDate>Mon, 23 Aug 2010 14:57:01 +0000</pubDate>
		<dc:creator>John Kreisa</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4435</guid>
		<description><![CDATA[This post was contributed by Jonathan Seidman from Orbitz. Jonathan is a Lead Engineer on the Intelligent Marketplace/Machine Learning team at Orbitz Worldwide . You can hear more from Jonathan at Hadoop World October 12th in NYC.
Orbitz Worldwide (NYSE:OWW) is composed of a global portfolio of online consumer travel brands including Orbitz, Cheaptickets, The Away Network, ebookers [...]]]></description>
			<content:encoded><![CDATA[<p><strong><em>This post was contributed by Jonathan Seidman from <a title="Orbitz" href="http://www.orbitz.com/" target="_blank">Orbitz</a>. Jonathan is a Lead Engineer on the Intelligent Marketplace/Machine Learning team at Orbitz Worldwide</em> .<em> You can hear more from Jonathan at <a title="Hadoop World" href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/" target="_blank">Hadoop World </a>October 12th in NYC.</em></strong></p>
<p>Orbitz Worldwide (NYSE:OWW) is composed of a global portfolio of online consumer travel brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub, Additionally, the company operates business-to-business service: Orbitz Worldwide Distribution provides third parties such as Amtrak, Delta, LAN, KLM, Air France and a number of other leading airlines hotel booking capabilities, and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients. The Orbitz Worldwide sites process millions of searches and transactions every day, which not surprisingly results in hundreds of gigabytes of log data per day. Not all of that data necessarily has value, but much of it does. Unfortunately storing and processing all of that data in our existing data warehouse infrastructure is impractical because of expense and space considerations.</p>
<p>Hadoop was selected to provide a solution to the problem of long-term storage and processing of these large quantities of un-structured and semi-structured data. We deployed our first Hadoop clusters in late 2009 running <a title="CDH" href="http://www.cloudera.com/hadoop/" target="_blank">Cloudera’s Distribution for Hadoop </a>(CDH), and in early 2010 deployed Hive to provide structure and SQL-like access to Hadoop data. In the short period of time since our initial deployment we’ve seen Hadoop rapidly adopted as a component in a wide range of applications across the organization due to its power, ease of use, and suitability for solving big data problems.</p>
<p>One of the applications that Hadoop facilitates is an effort to improve the hotel search results. Currently, when a user performs a hotel search on the Orbitz site the ranking of the search results returned (at least for larger markets) is influenced by a set of parameters manually tuned by an administrator. This leads to the question: can we use automation to optimize the ranking of hotels in order to increase bookings? In other words, can we identify consumer preferences in order to determine the best performing hotels to display to users, thus leading to more bookings? Further, for markets that are too small to be manually managed, can we implement a method to automatically rank hotel search results?</p>
<p>To answer this question, it was decided to turn to machine learning techniques, specifically using a trained classifier to determine a ranking of hotels that more closely follows consumer preferences. Performing this analysis requires having data on consumer interactions when shopping for hotels. Fortunately, we have a rich source of this session data in web analytics logs that are collected as users browse the sites. Unfortunately, although parts of this data are loaded into the data warehouse, it turned out that the specific fields we require are not loaded because of space restrictions. Our only alternative was to turn to the raw logs to extract the required fields. Just to further complicate things, the available archive of these logs only went back several days – not nearly enough data to perform the required analysis.</p>
<p>Hadoop of course provided a solution to the storage problem by providing a repository where we could download and archive logs. The next step was to extract the data we needed from the raw logs. We began with a set of shell and Perl scripts that were run manually to serially process logs on the local file system. This process worked fine for a while, but as the size of the data grew it was obvious that this process wouldn’t scale. Once again Hadoop provided a solution. Since we were already storing the logs in HDFS, by moving the most time-consuming portions of the data extraction into MapReduce, we were able to dramatically decrease processing time.  A test run against a small subset of data showed a greater than four time improvement for the MapReduce processing vs. the scripts. Now that we’ve accumulated several terabytes of data the performance disparity would be even more dramatic, assuming we even had access to a storage system large enough to hold all of the data for manual processing.</p>
<p>After the data is extracted through MapReduce, we load the resulting records into a set of Hive tables. Hive allows us to perform ad hoc querying and further analysis of this data, such as:</p>
<ul>
<li>Obtaining useful metrics, many of which were unavailable with our existing data stores.</li>
<li>Creating data exports for further analysis with R scripts, allowing us to derive more complex statistics and visualizations of our data.</li>
<li>Aggregating data for import into our data warehouse for creation of new data cubes, providing analysts access to data unavailable in existing data cubes.</li>
</ul>
<p> In addition to assisting with hotel rank optimization, a few examples of other ways Hadoop is being applied at Orbitz Worldwide are:</p>
<ul>
<li>Measuring page download performance: using web analytics logs as input, a set of MapReduce scripts are used to derive detailed client side performance metrics which allow us to track trends in page download times.</li>
<li>Searching production logs: an effort is underway to utilize Hadoop to store and process our large volume of production logs, allowing developers and analysts to perform tasks such as troubleshooting production issues.</li>
<li>Data aggregation for the data warehouse: further exploration is being done to expand the use of Hadoop and Hive as a means to aggregate previously unavailable data for import into our data warehouse, making it available for access by our existing data analysis tools.</li>
<li>Cache analysis: extraction and aggregation of data to provide input to analyses intended to improve the performance of data caches utilized by our web sites.</li>
</ul>
<p> Again, these are just a few examples of how Hadoop is being utilized at Orbitz Worldwide, and we’re still just scratching the surface. Each week seems to bring a new team with a big data challenge to be solved by Hadoop, a trend which I expect to continue as more teams discover the possibilities that Hadoop provides to store and process data.</p>
<p>I’d like to thank my co-workers who have all made significant contributions to the work discussed here, including Rob Lancaster, Ramesh Venkataramaiah, Wai Gen Yee, Steve Hoffman, Matt Haddock and Andrew Yates.  Also a big thanks to Vice President of Technology Roger Liew, who was an early and enthusiastic champion of Hadoop.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/improving-hotel-search-hadoop-orbitz-worldwide/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Hadoop World: NYC &#8211; Training</title>
		<link>http://www.cloudera.com/blog/2010/08/hadoopworld-training/</link>
		<comments>http://www.cloudera.com/blog/2010/08/hadoopworld-training/#comments</comments>
		<pubDate>Thu, 19 Aug 2010 15:00:23 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[HBase]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[training]]></category>
		<category><![CDATA[hadoopworld]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4354</guid>
		<description><![CDATA[Hadoop Training surrounding Hadoop World: NYC.]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify">Our vision for Hadoop World is a conference where both newcomers and experienced Hadoop users can learn and be part of the growing Hadoop community.</p>
<p style="text-align: justify">We are also offering training sessions for newcomers and experienced Hadoop users alike. Whether you are looking for an Introduction to Hadoop, Hadoop Certification, or you want to learn more about related Hadoop projects we have the training you are looking for.</p>
<p style="text-align: center"><img class="size-full wp-image-4403    aligncenter" src="http://www.cloudera.com/wp-content/uploads/2010/08/hw_white2.gif" alt="" width="169" height="130" /></p>
<p style="text-align: justify">Included with our top-notch Hadoop training you will have full access to Hadoop World free of charge.</p>
<p style="text-align: justify">Available Training Sessions include:<span id="more-4354"></span></p>
<h2 style="text-align: justify"><span style="text-decoration: underline">Oct 11:</span></h2>
<h3 style="text-align: justify"><em>Introduction to Hadoop</em>: <a href="http://www.eventbrite.com/event/762326138">http://www.eventbrite.com/event/762326138</a></h3>
<p style="text-align: justify">This one-day course provides a solid foundation for those seeking to understand large scale data processing with MapReduce and Hadoop. This session is designed for developers, analysts or system administrators that are new to Hadoop. This course provides the pre-requisite knowledge for the later classes: Developer Training, Administrator Training or Analyzing Data with Hive and Pig.</p>
<h3 style="text-align: justify"><em>Hadoop Essentials For Managers: </em><em> </em><a href="http://www.eventbrite.com/event/762237874">http://www.eventbrite.com/event/762237874</a></h3>
<p style="text-align: justify">This one-day course will give decision-makers the information they need to know about Apache Hadoop, answering questions such as:</p>
<ul style="text-align: justify">
<li>When is Hadoop appropriate?</li>
<li>What are people using Hadoop      for?</li>
<li>How does Hadoop fit into our      existing environment?</li>
<li>What do I need to know about      choosing Hadoop?</li>
</ul>
<h3 style="text-align: justify"><em>Cloudera HUE SDK Training</em>: <a href="http://www.eventbrite.com/event/764021208">http://www.eventbrite.com/event/764021208</a></h3>
<p style="text-align: justify">Cloudera Hue provides developers with back end APIs to simplify interacting with Hadoop and front end APIs to deliver rich, web based, graphical user experiences. For this training, developers should have experience building web apps using modern MVC frameworks and Ajax. Experience with Python and Django is a strong plus. In this session we spend half the day covering the following topics, and the other half of the day interactively building applications with the Cloudera Hue team.</p>
<h2 style="text-align: justify"><span style="text-decoration: underline">Oct 13 &amp; 14:</span></h2>
<h3 style="text-align: justify"><em>Developer Training &amp; Certification</em>: <a href="http://www.eventbrite.com/event/762320120">http://www.eventbrite.com/event/762320120</a></h3>
<p style="text-align: justify">In this two-day hands-on session, developers learn the MapReduce framework and how to write programs against its API. In addition to learning how to write individual MapReduce jobs, we discuss design techniques for larger workflows. This course also covers advanced skills for debugging MapReduce programs and optimizing their performance. At the end of the course, attendees have the option to take a certification exam documenting their understanding of the concepts taught during the training session.</p>
<h3 style="text-align: justify"><em>Administrator Training &amp; Certification:</em> <a href="http://www.eventbrite.com/event/762677188">http://www.eventbrite.com/event/762677188</a></h3>
<p style="text-align: justify">This two-day hands-on session covers the system administration aspects of Hadoop from installation and configuration to load balancing and tuning including diagnosing and solving problems in your deployment. At the end of the course, attendees have the option of taking a certification exam documenting their understanding of the concepts taught at the training session.</p>
<h3 style="text-align: justify"><em>Analyzing Data with Hive and Pig:</em> <a href="http://www.eventbrite.com/event/762318114">http://www.eventbrite.com/event/762318114</a></h3>
<p style="text-align: justify">Cloudera’s two-day hands-on course on Hive and Pig is designed for people who have a basic understanding of how Hadoop works and want to utilize these languages for analysis of their data. Hive makes Hadoop accessible to users who already know SQL; Pig is similar to popular scripting languages. This course teachs you how to process data by using filters, joins, user-defined functions and more.</p>
<h2 style="text-align: justify"><span style="text-decoration: underline">Oct 15:</span></h2>
<h3 style="text-align: justify"><em>HBase Training</em>: <a href="http://www.eventbrite.com/event/762317111">http://www.eventbrite.com/event/762317111</a></h3>
<p style="text-align: justify">This one-day hands-on course gives you the necessary knowledge for using HBase as a distributed data store to achieve low-latency queries and highly scalable throughput. This class covers the HBase architecture, data model, and Java API as well as advanced topics and best practices. This course is for developers who already have a basic understanding of Hadoop (Java experience is recommended).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/hadoopworld-training/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop/HBase Capacity Planning</title>
		<link>http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/</link>
		<comments>http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/#comments</comments>
		<pubDate>Tue, 17 Aug 2010 18:43:11 +0000</pubDate>
		<dc:creator>Alex Kozlov</dc:creator>
				<category><![CDATA[HBase]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[ZooKeeper]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[sizing]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4324</guid>
		<description><![CDATA[Hadoop  and HBase are gaining popularity due to their flexibility and  tremendous work that has been done to simplify their installation and  use.  This blog is to provide guidance in sizing your first Hadoop/HBase  cluster.  First, there are significant differences in Hadoop and HBase  usage.  Hadoop MapReduce is primarily an [...]]]></description>
			<content:encoded><![CDATA[<p>Hadoop  and HBase are gaining popularity due to their flexibility and  tremendous work that has been done to simplify their installation and  use.  This blog is to provide guidance in sizing your first Hadoop/HBase  cluster.  First, there are significant differences in Hadoop and HBase  usage.  Hadoop MapReduce is primarily an analytic tool to run analytic and data  extraction queries over <em>all of your data</em>, or at least a significant portion of them (data is a plural of datum).  HBase is much better for real-time <em>read/write/modify access to tabular data</em>.  Both applications are  designed for high concurrency and large data sizes.  For a general  discussions about Hadoop/HBase architecture and differences please refer  to Cloudera, Inc. [<a href="https://wiki.cloudera.com/display/DOC/Hadoop+Installation+Documentation+for+Cloudera+Enterprise" target="_blank">https://wiki.cloudera.com/display/DOC/Hadoop+Installation+Documentation+for+Cloudera+Enterprise</a><em>, </em><a href="http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-hbase" target="_blank">http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-hbase</a>], or Lars George blogs [<a href="http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html" target="_blank">http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html</a>].  We expect a new edition of the Tom White&#8217;s Hadoop book [<a href="http://www.hadoopbook.com" target="_blank">http://www.hadoopbook.com</a>] and a new HBase book in the near future as well.</p>
<p><div>Hadoop core is a file system, called HDFS, and the  actual MapReduce implementation that can be used to compute on top of  the HDFS.  Since we are talking about data, the first crucial parameter  is how much disk space we need on all of the Hadoop nodes to store all  of your data and what compression algorithm you are going to use to  store the data.  For the MapReduce components an important consideration  is how much computational power you need to process the data and  whether the jobs you are going to run on the cluster is CPU or I/O  intensive.  An example of a CPU intensive job is image processing while  an I/O intensive job is a simple data loading or aggregation.  Finally,  HBase is mainly memory driven and we need to consider the data access  pattern in your application and how much memory you need so that the  HBase nodes do not swap the data too often to the disk.  Most of the  written data end up in memstores before they finally end up on disk, so  you should plan for more memory in write-intensive workloads like web  crawling.  A good application for HBase is a low latency key-based  retrieval and storage of semi-structured data like web crawls or  dimensional data for joining with a DW fact table, particularly if the  data need update time tracking and can be easily grouped into column  families.</div>
<p><div>General Cloudera hardware recommendations are given <a href="http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations" target="_blank">here</a>.  This blog will focus on more detailed capacity planning issues.</div>
<p><h3>Network</h3>
<div>While the subject of network latency, throughput and bandwidth is very  often overlooked when starting to work with Hadoop, it is bound to  become a limiting factor as your cluster grows.  Each node in a Hadoop  cluster needs to be able to communicate with each other with low latency  and high throughput at least to grab the relevant data.  Besides, if the  the nodes are not able to communicate with the master node, the master  node will automatically think that they are dead and delist them, which  will lead to an increased load on the rest  of the nodes.  Hadoop will  work with off-the-shelf TCP/IP network.</div>
<p><div>Network  load depends on the nature of analytical computations in the cluster.   One simple application that requires a lot of communication between  nodes is <a href="http://sortbenchmark.org" target="_blank">sorting</a>.  In fact, <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html" target="_blank">TeraSort</a> is a good test to detect network issues in the cluster.</div>
<p><div>A  typical configuration is to organize the nodes into racks with a 1GE  Top Of Rack (TOR) switch.  The racks are typically interconnected by one  or more low-latency high-throughput dedicated Layer-2 10GE core  switches.  Many customers are happy with ~40 node clusters that can fit  onto one rack with a typical 48-port switch.  Even if all of your nodes  can fit into one rack but you plan to scale beyond one rack, Cloudera  recommends to go with at least two racks from the start to enforce  proper practices and network topology scripting.</div>
<p><div>Network  problems can manifest themselves indirectly.  A good practical test is  to run a network intensive application like terasort, which sorts 10M  100 byte records (the specific parameters can be adjusted to your  cluster size),  on your cluster.  On a 100-node cluster with a quad  dual-core CPU hardware the running time should be roughly within 10  minutes (one of our customers sorted 1TB in 6 minutes on a 76-node  cluster, the numbers are likely to go down with new 12-core CPU  machines).  If you see &#8220;Bad connect ack with firstBadLink&#8221;, &#8220;Bad connect  ack&#8221;, &#8220;No route to host&#8221;, or &#8220;Could not obtain block&#8221; IO exceptions  under heavy loads, chances are these are due to a bad network.  Even one  slow network card on one of the nodes can slow total job execution as  much as a factor of 3-4 since the job completion is limited by the the  slowest task.  This problems can also manifest themselves as  &#8216;intermittent&#8217; under heavy loads, but usually go away with proper  network configuration and tuning.</div>
<p><div>Network  connection to outside systems is important for loading data into the  HDFS and interoperability.  Some companies prefer to have a dedicated  high-bandwidth network for loading the data (as opposed to just using VLAN).</div>
<p><h3><strong>Memory</strong></h3>
<div>HBase is a very memory hungry application.  Each node in HBase  installation, called RegionServer, keeps a number of regions, or chunks  of your data, in memory (if caching is enabled).  Ideally, the whole  table would be kept in memory but this is not possible with a TB  dataset.  Typically, a single RS can handle a few 100s of  regions with each 1 or 2GBs (these are configurable parameters).  The  number of HBase nodes and memory requirements should be planned  accordingly.  From our experience, the memory requirement is at least  4GB/RS for any decent load, but depends significantly on your application load and access pattern.</div>
<p><div>For  Hadoop MapReduce, you want to allocate somewhere between 1GB and 2GB of  memory per task on top of the memory allocated for HBase for large  clusters:  As the cluster grows, you should plan for a slight overhead  in both the tasks memory and the number of simultaneously opened  tasktracker connections, controlled by <em>tasktracker.http.threads </em><em>and mapred.reduce.</em><em>parallel</em><em>.</em><em>copies</em>, to be able to serve more node-to-node connections.</div>
<p><div>Both  Hadoop and HBase memory problems will manifest in slowness of the whole  system since both systems were not designed to rely on swapping.  It is  recommended to discourage swapping on HBase nodes (set <em>vm.swappiness</em> to 0 or 5 in <em>/etc/sysctl.conf</em>) and to enable GC logging (add &#8220;<em><em>-Xloggc:/var/log/hbase/gc-hbase.log -verbose:gc </em>-XX:+PrintGC <em>-XX:+PrintGCDetails -XX:+PrintGCTimeStamps</em> -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime</em> &#8221;  to the JVM opts) to look for large GC pauses in the log.  GC pauses  longer than 60 seconds can cause RS to go offline (even worse problems  can occur if you run a ZK on the same node and it becomes unresponsive),  but pauses as long as 1 second usually lead to noticeable  responsiveness problems.  For HBase daemons, RS and ZK, Cloudera also recommends to switch to CMS GC  (add &#8220;<em>-XX:+UseConcMarkSweepGC -XX:-CMSIncrementalMode</em>&#8221; to the JVM opts).  There is also work to develop <a href="http://www.managedruntime.org" target="_blank">pauseless JVMs</a>.</div>
<p><div>If  a Hadoop node is running an HBase RS daemon together with a Hadoop TT daemon, Cloudera recommends to  reduce the maximum number of map/reduce tasks via configuring  <em>mapred.tasktracker.{map,reduce}.tasks.maximum</em> parameter.   You can start with 1-2 map/reduce tasks per tasktracker and slowly  increase the number until you see a degradation in the HBase  performance.</div>
<p><div>Often network and memory problems manifest themselves first in ZK [<a href="http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A15" target="_blank">http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A15</a>].  ZK is a distributed lock system and is often called a &#8220;canary&#8221; of HBase.</div>
<p><div>A <em>vmstat</em> or  Ganglia tool should be used to monitor memory status on the RS nodes.   Some VM GC information can be gathered via metrics interface accessible  via Jetty interface at <em>&lt;hadoop/hbase-web-ui&gt;/metrics</em>, for example <em><a href="http://node:50060/metrics" target="_blank">http://node:50060/metrics</a></em>, if this is properly configured in <em>hadoop-metrics.properties</em>.</div>
<p><div>One  should also keep in mind that even though the system does not get OOM  exceptions, the OS and disk I/O performance may be compromised if the  system is low on available memory since the system is under GC pressure  and less memory is available to OS to buffer I/O (&#8220;memory cached&#8221;) to  speed up other operations.</div>
<p><h3>Disk</h3>
<div>First, Hadoop requires at least two locations for storing it&#8217;s files: <em>mapred.local.dir</em>, where MapReduce stores intermediary files, and <em>dfs.data.dir</em>, where HDFS stores the HDFS data (there are other locations as well, like <em>hadoop.tmp.dir</em>,  where Hadoop and components stores its temporary data).  Both of them can cover  multiple partitions.  While the two locations can be placed on  physically different partitions, Cloudera recommends to configure them  across the same set of partitions to maximize disk-level parallelism  (this might not be an issue if the number of disk is much larger than  the number of cores).</div>
<p><div>The  sizing guide for HDFS is very simple: each file has a default  replication factor of 3 and you need to leave approximately 25% of the  disk space for intermediate shuffle files.  So you need 4x times the raw  size of the data you will store in the HDFS.  However, the files are  rarely stored uncompressed and, depending on the file content and the  compression algorithm, on average we have seen a compression ratio of up  to 10-20 for the text files stored in HDFS.  So the actual raw disk  space required is only about 30-50% of the original uncompressed size.   Compression also helps in moving the data between different systems,  e.g. Teradata and Hadoop.</div>
<p><div>HBase  stores the regions in HFiles.  However, during the major compaction the  data may be doubled for a given region temporarily.  In addition to  HFile storage, there is a small overhead due to WALs, which ideally  should be a small portion of the total data size.  Cloudera recommends a  30-50% overhead in terms of free space for HFiles.</div>
<p><div>While  you can run Hadoop MapReduce with only 5-10% of the disk space left, the  performance will be compromised due to fragmentation.  Disk performance  can be up to 77% slower due to fragmentation and other issues compared  to the &#8220;empty disk&#8221; [<a href="http://www.eecs.harvard.edu/vino/fs-perf/papers/keith_a_smith_thesis.pdf" target="_blank">http://www.eecs.harvard.edu/vino/fs-perf/papers/keith_a_smith_thesis.pdf</a>].  With a disk more than 80% full you also run the risk of running out of disk space on an individual mount.</div>
<p><h3>CPU</h3>
<div>Cloudera recommends total 8 or 12 cores per node, and typically one  would have the number of cores equal or slightly larger than the number  of spindles.  One would like have the total number of mappers and  reducers to be total number of hyperthreads &#8211; 2 (2 is for daemons and OS  processing) and the ratio of mappers to reducers slightly skewed  towards mappers as the reducers tend to spend more time waiting for the  mappers.  The importance of CPU power increases with CPU intensive jobs  and when using more compute-intensive compression like BZip2.</div>
<p><div>A typical configuration may be found <a href="http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations" target="_blank">here</a>.</div>
<p><h3>Summary</h3>
<div><span style="text-decoration: underline"><br />
</span></div>
<table border="1" cellspacing="0" cellpadding="2" width="80%">
<tbody>
<tr>
<td valign="top"></td>
<td valign="top">Network</td>
<td valign="top">Memory</td>
<td valign="top">Disk</td>
<td valign="top">CPU</td>
<td valign="top"># of nodes</td>
</tr>
<tr>
<td valign="top">HDFS</td>
<td valign="top">1GE TOR, 10GE core</td>
<td valign="top"></td>
<td valign="top">8-10 spindles/node</td>
<td valign="top"></td>
<td valign="top">enough nodes to fit the data</td>
</tr>
<tr>
<td valign="top">Hadoop MapReduce</td>
<td valign="top">1GE TOR, 10GE core</td>
<td valign="top">1-2 GB/task</td>
<td valign="top"># of spindles = # of cores</td>
<td valign="top">8-12 cores/node, # of tasks = # of hyperthreads &#8211; 2</td>
<td valign="top"></td>
</tr>
<tr>
<td valign="top">HBase</td>
<td valign="top">1GE TOR, 10GE core</td>
<td valign="top">at least 4GB/node</td>
<td valign="top"></td>
<td valign="top">8-12 cores/node, reduce # of tasks if running with Hadoop DN/TT</td>
<td valign="top">enough nodes to fit all regions and serve requests</td>
</tr>
</tbody>
</table>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
