<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera</title>
	<atom:link href="http://www.cloudera.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Wed, 08 Sep 2010 14:00:19 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Flume community update: September 2010</title>
		<link>http://www.cloudera.com/blog/2010/09/flume-community-update-september-2010/</link>
		<comments>http://www.cloudera.com/blog/2010/09/flume-community-update-september-2010/#comments</comments>
		<pubDate>Wed, 08 Sep 2010 14:00:19 +0000</pubDate>
		<dc:creator>Jonathan Hsieh</dc:creator>
				<category><![CDATA[Flume]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[data collection]]></category>
		<category><![CDATA[general]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4698</guid>
		<description><![CDATA[The past month has been exciting and productive for the community using and developing Cloudera&#8217;s Flume!  This young system is a core part of Cloudera&#8217;s Distribution for Hadoop (CDH) that is responsible for streaming data ingest.  There has been a great influx of interest and many contributions, and in this post we will provide a quick summary [...]]]></description>
			<content:encoded><![CDATA[<p>The past month has been exciting and productive for the community using and developing Cloudera&#8217;s Flume!  This young system is a core part of Cloudera&#8217;s Distribution for Hadoop (CDH) that is responsible for streaming data ingest.  There has been a great influx of interest and many contributions, and in this post we will provide a quick summary of this month&#8217;s new developments. First, we&#8217;re happy to announce the availability of <strong>Flume v0.9.1</strong> and we will describe some of its updates.  Second, we&#8217;ll talk about some of the exciting <strong>new integration features</strong> coming down the pipeline.  Finally we will briefly mention some <strong>community growth</strong> statistics, as well as some recent and upcoming talks about Flume.</p>
<p><span style="font-size: 13.3333px;"><strong>Flume v0.9.1</strong></span></p>
<p>Flume v0.9.1 is now available both in tarball and packaged forms. This version resolves 63 issues and contains several key improvements and bugs fixes.  Much of this release is focused on improving the stability of Flume&#8217;s internals to help users quickly get Flume up and running and to help developers build extensions to Flume.</p>
<p>You can download the new release as an update to your Redhat RPM or Debian DEB based package managers. Or, you can download it in tarball form from <a href="http://archive.cloudera.com/cdh/3/">Cloudera&#8217;s archive</a>, or as always from the <a href="http://github.com/cloudera/flume">Cloudera&#8217;s github repository </a>.</p>
<p><span style="font-size: 13.3333px;">The key functional highlights include:</span></p>
<ul>
<li><span style="font-size: 13.3333px;">Support for <a href="https://issues.cloudera.org/browse/FLUME-29">gzip compressed output files</a>.</span></li>
<li><span style="font-size: 13.3333px;">New and improved sources: <a href="https://issues.cloudera.org/browse/FLUME-36">scribe</a>, <a href="https://issues.cloudera.org/browse/FLUME-54">syslog</a>, <a href="https://issues.cloudera.org/browse/FLUME-26">tailDir</a> (tail all files in a directory)</span></li>
<li><span style="font-size: 13.3333px;">Significant robustness improvements when using in the<a href="https://issues.cloudera.org/browse/FLUME-153"> disk fail-over </a></span><span style="font-size: 13.3333px;"><a href="https://issues.cloudera.org/browse/FLUME-153">and end-to-end reliability modes</a>.</span></li>
<li><span style="font-size: 13.3333px;">Significant robustness improvements when <a href="https://issues.cloudera.org/browse/FLUME-53">reconfiguring, </a></span><span style="font-size: 13.3333px;"><a href="https://issues.cloudera.org/browse/FLUME-53">commissioning, and decomissioning logical nodes</a>.</span></li>
</ul>
<p>To improve the documentation and enhance debugging support, we have added:</p>
<ul>
<li><span style="font-size: 13.3333px;">A <a href="http://archive.cloudera.com/cdh/3/flume/UserGuide.html#_extending_via_sink_source_decorator_plugins">new section of the manual</a> that explains how to build your own flume source, sink, and decorator </span><span style="font-size: 13.3333px;">plugins by example.</span></li>
<li><span style="font-size: 13.3333px;">An <a href="https://issues.cloudera.org/browse/FLUME-12">&#8216;ant eclipse&#8217; option</a> to automatically build project files for </span><span style="font-size: 13.3333px;">developing in the Eclipse IDE.</span></li>
<li><span style="font-size: 13.3333px;">Improved error messages in logs, exposed Flume internals such as <a href="https://issues.cloudera.org/browse/FLUME-84">current configuration properties</a>, and <a href="https://issues.cloudera.org/browse/FLUME-83">source/sink catalogs</a> </span><span style="font-size: 13.3333px;">to ease operator and developer debugging and verification.</span></li>
</ul>
<p>For more details, read the <a href="https://issues.cloudera.org/secure/ReleaseNote.jspa?projectId=10010&amp;version=10013">full release notes</a>.</p>
<p><strong>Up and coming Flume features</strong></p>
<p>One of Flume&#8217;s key design principles is extensibility.  We are happy people are taking advantage of this to integrate Flume with other systems.  Some new features currently being developed will enable the next release of Flume to have greater integration with CDH&#8217;s core components as well as other systems in the Hadoop ecosystem.</p>
<p>Here are some of the new major contributions near completion or actively in the works:</p>
<ul>
<li><span style="font-size: 13.3333px;"><a href="https://issues.cloudera.org/browse/FLUME-74">Flume + Hive integration plugin</a>.  Mozilla&#8217;s Anurag Phadke has been </span><span style="font-size: 13.3333px;">working with Cloudera&#8217;s Carl Steinbach to automatically import data ingested by Flume </span><span style="font-size: 13.3333px;">into Hive warehouses.</span></li>
<li><span style="font-size: 13.3333px;"><a href="https://issues.cloudera.org/browse/FLUME-6">Flume + HBase integration plugin</a>.  Several guests at the recent <a href="http://www.cloudera.com/blog/2010/07/notes-from-the-hackathon-at-cloudera/">Cloudera Hackathon</a> improved upon our </span><span style="font-size: 13.3333px;">initial Flume/HBase connector </span><span style="font-size: 13.3333px;">and posted <a href="https://issues.cloudera.org/browse/FLUME-126">it</a> so the community could continue</span><span style="font-size: 13.3333px;"> improving it.  Since then, a more generic design </span><span style="font-size: 13.3333px;">was proposed and Cloudera&#8217;s new intern, Dani Rayan, has volunteered to </span><span style="font-size: 13.3333px;">implement it.</span></li>
<li><span style="font-size: 13.3333px;"><a href="https://issues.cloudera.org/browse/FLUME-20">Flume + Cassandra integration plugin</a>. Tyler Hobbs contributed a first version of this plugin.  It is blocked by some T</span><span style="font-size: 13.3333px;">hrift compatibility and dependency issues. </span></li>
<li><span style="font-size: 13.3333px;"> <a href="https://issues.cloudera.org/browse/FLUME-13">Secured data transport via TLS</a>.  Kim Vogt and Ben </span><span style="font-size: 13.3333px;">Standefer from SimpleGeo, with some feedback from David Zuelke of </span><span style="font-size: 13.3333px;">Bitextender have been working on adding TLS-based wire </span><span style="font-size: 13.3333px;">encryption to the RPC sources and sinks to provide </span><span style="font-size: 13.3333px;">secure data center communications.</span></li>
<li><span style="font-size: 13.3333px;"> <a href="https://issues.cloudera.org/browse/FLUME-197">Flume + Kerberized HDFS integration</a>.  Flume takes its first steps to support the newer versions of HDFS that </span><span style="font-size: 13.3333px;">require Kerberos authentication in order to read from and write to HDFS.</span></li>
<li><span style="font-size: 13.3333px;"><a href="https://issues.cloudera.org/browse/FLUME-132">Generic compression codec support for output files</a>. This enables users to choose from all of the codecs Hadoop supports: gzip, bzip2, and deflate.  It should also enable the LZO codec with a little extra work.</span></li>
<li><span style="font-size: 13.3333px;">Documentation improvements galore.  Currently in the works are a </span><span style="font-size: 13.3333px;">semantics specification for sources and sinks, </span><span style="font-size: 13.3333px;">and step-by-step instructions for connecting </span><span style="font-size: 13.3333px;">Flume to common sources such as  Apache web servers, syslog, and existing scribe loggers</span><span style="font-size: 13.3333px;">. </span></li>
</ul>
<p><span style="font-size: 13.3333px;"><strong>Community </strong></span></p>
<p><span style="font-size: 13.3333px;">We are really grateful to the folks who have been exploring and talking about the project!  The guests (Dustin Sallings of NorthScale and Ron Bodkin among others&#8230;) who tried out Flume at the Cloudera&#8217;s Hackathon day gave us valuable feedback.  In the past month, Cloudera&#8217;s Henry Robinson presented &#8220;<a href="http://www.slideshare.net/cloudera/inside-flume">Inside Flume</a>&#8221; at Hadoop Day in Seattle.   I<span style="font-size: 13.3333px;">t is also great to see that some folks are slated to present at <a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/">Hadoop World 2010</a> about integrating and using Flume.  Otis Gospodnetic of Sematext will be talking about analytics with Flume and HBase.  Also, Anurag Phadke from Mozilla will be presenting a talk about of his experiences integrating Flume-collected data automatically into Hive. He recently posted some details in his <a href="http://blog.mozilla.com/data/2010/08/15/collecting-and-analyzing-log-data-via-flume-and-hive/">blog</a>.</span></span></p>
<p><span style="font-size: 13.3333px;"><span style="font-size: 13.3333px;">It is great to see the community growing and w</span>e love hearing from all of you as well! <span style="font-size: 13.3333px;">It has been two months since Flume was open sourced, and our main github repository now has 136 watchers and 24 forks.  Our <a href="https://groups.google.com/a/cloudera.org/group/flume-user/topics">user mailing list</a> has 102 members and our <a href="https://groups.google.com/a/cloudera.org/group/flume-dev/topics">developers mailing list</a> has 41 members. Please join us! </span><span style="font-size: 13.3333px;">If you are using Flume and want to keep up with where it is going, join the mailing lists and follow us on Twitter at <a href="http://twitter.com/cloudera">@cloudera</a> and <a href="http://twitter.com/#search?q=%23Flume">#flume</a>.  If you need help, just send questions to the mailing lists or chat with us directly in IRC on channel #flume at irc.freenode.net.  To meet the Flume Team and contributors in person, you should join us in New York City at Hadoop World on October 12th!</span></span></p>
<div><span style="font-size: 13.3333px;">It has been a lot of fun so far, and we&#8217;re really looking forward to the following months! </span></div>
<div>
<p><span style="font-size: 13.3333px;"> </span></p>
<div><span style="font-size: 13.3333px;">Thanks from everyone on the Cloudera Team.</span></div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/09/flume-community-update-september-2010/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Purdue University’s Saptarshi Guha Interviewed Regarding Hadoop, R and Hadoop World</title>
		<link>http://www.cloudera.com/blog/2010/09/purdue-university%e2%80%99s-saptarshi-guha-interviewed-regarding-hadoop-r-and-hadoop-world/</link>
		<comments>http://www.cloudera.com/blog/2010/09/purdue-university%e2%80%99s-saptarshi-guha-interviewed-regarding-hadoop-r-and-hadoop-world/#comments</comments>
		<pubDate>Tue, 07 Sep 2010 14:00:25 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4664</guid>
		<description><![CDATA[In anticipation of Hadoop World 2010 in New York – October 12th, we continue our Q&#38;A series with Hadoop World presenters to provide a taste of what attendees can expect. We’re excited about the 36 presentations that are planned (see agenda) including talks from eBay, Twitter, GE, Bank of America, Facebook, Digg, HP and more. [...]]]></description>
			<content:encoded><![CDATA[<p>In anticipation of <a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/">Hadoop World 2010</a> in New York – October 12th, we continue our Q&amp;A series with Hadoop World presenters to provide a taste of what attendees can expect. We’re excited about the 36 presentations that are planned (see <a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/agenda/">agenda</a>) including talks from eBay, Twitter, GE, Bank of America, Facebook, Digg, HP and more. Tim O’Reilly, founder of <a href="http://oreilly.com/">O’Reilly Media</a> is keynoting, which should be inspiring as well as thought provoking. Everyone who <a href="http://hadoopworld2010.eventbrite.com/">registers</a> for Hadoop World will receive a free copy of the second edition of Tom White’s <a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979">Hadoop: The Definitive Guide</a>.</p>
<p>Hadoop World 2010 presenter Saptarshi Guha works in the Department of Statistics at Purdue University. His presentation for Hadoop World is titled “Using R and Hadoop to Analyze VoIP Network Data for QoS.” Guha has been developing with Hadoop and R for over a year.</p>
<h2><strong>Q: What can attendees expect learn about Hadoop from your presentation at Hadoop World?</strong></h2>
<p>The quality of VoIP calls are suspect to the queuing effects introduced by the network gateways. The jitter between two consecutive packets is the deviation of the real inter-arrival time from theoretical.</p>
<p>We use the R environment for the statistical analysis of data to show jitter follows desired properties and is negligible, which demonstrates that the measured traffic is close to the offered traffic.  Data sets used to study the departure from offered load can be massive and require detailed study of several complex data structures. Using an environment that integrates R and Hadoop, we hope to demonstrate the effectiveness of R and Hadoop for the comprehensive statistical analyses of massive data sets.</p>
<h2><strong>Q: Describe use cases for Hadoop at Purdue.</strong></h2>
<p>Our team works with large amounts of network traffic data collected for VoIP and network security projects. Our language of analysis is almost exclusively R and we need a way to store the 190 gigabytes of VoIP related data, create data structures for analysis and compute across these. The R and Hadoop combination allows us to do all of this in a manner that scales with the size of the data and returns results within acceptable time frames. Despite not having HBase installed, we use Hadoop map files and R to query data structures from a database of 14 million objects spanning 21GB within seconds.</p>
<h2><strong>Q: What benefits do you see from Hadoop?</strong></h2>
<p>The biggest win is the reduction in computing time, the ease of programming in the R and Hadoop environment and the Hadoop Distributed Filesystem. We have stopped worrying about disk space and freely store as many databases of objects as required. It must be mentioned, that Hadoop DFS and MapReduce are both very easy to setup and return very impressive results. For our approach to analysis, the Hadoop MapReduce paradigm fits very well. We partition the data into many subsets (usually by the levels of categorical variables), compute across these and recombine the results. We also visualize a subset of these and recombine the results in to multi panel multi page displays, which are viewed across large 30&#8243; monitors.</p>
<h2><strong><strong>Q: What did you use before Hadoop?</strong></strong></h2>
<p>Some of the things we have done were impossible without Hadoop. Before this we used a tree hierarchy of directories of flat files containing R objects and index files to locate objects with in these flat files. Distributing computation across our cluster was a laborious, manual and very project specific affair but now using the R and Hadoop system the we have sufficiently abstracted the workflow to span a multitude of data sets.</p>
<h2><strong>Q: How has Hadoop improved your work at Purdue?</strong></h2>
<p>Hadoop has certainly improved our workflow, allowing the researchers to think about studying the data rather than how to distribute code and data, how to maintain a cluster, or how to tackle tedious but vital things such as computer failure. Because the time to compute is substantially less the researchers have the flexibility to implement their ideas and interactively analyze the data. We hope to increase our cluster size and bring more people into the fold.</p>
<h2><strong>Q: What are you hoping to get out of your time at Hadoop World?</strong></h2>
<p>To demonstrate that it is indeed possible to comprehensively analyze gigabytes of data with a level of detail that was only possible with small data sets and to learn of new Hadoop related technologies that might benefit our workflow.</p>
<p>Hear more from Guha at <a href="http://hadoopworld2010.eventbrite.com/">Hadoop World in New York!</a> For more information regarding the conference its-self click <a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/09/purdue-university%e2%80%99s-saptarshi-guha-interviewed-regarding-hadoop-r-and-hadoop-world/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Look Back at August Posts</title>
		<link>http://www.cloudera.com/blog/2010/09/summary-of-august-posts/</link>
		<comments>http://www.cloudera.com/blog/2010/09/summary-of-august-posts/#comments</comments>
		<pubDate>Mon, 06 Sep 2010 19:45:09 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4680</guid>
		<description><![CDATA[Migrating to CDH &#8211; August 2
You will learn everything you need to know about migrating with CDH3b2 ranging from why migrate to testing.
Flume community update - August 3
In this blog we address Flume issues, talk about new features, and the improvement of the platform.
Hadoop World: early-bird rate ends on August 11 &#8211; August 9
The early-bird registration [...]]]></description>
			<content:encoded><![CDATA[<h2><a href="http://www.cloudera.com/blog/2010/08/migrating-to-cdh3/"><span style="color: #2f99bb">Migrating to CDH</span></a> &#8211; August 2</h2>
<p>You will learn everything you need to know about migrating with CDH3b2 ranging from why migrate to testing.</p>
<h2><strong><a href="http://http//www.cloudera.com/blog/2010/08/flume-community-update-the-first-30-days/"><span style="color: #2f99bb">Flume community update</span></a> - August 3</strong></h2>
<p>In this blog we address Flume issues, talk about new features, and the improvement of the platform.</p>
<h2><strong><a href="http://www.cloudera.com/blog/2010/08/hadoop-world-early-bird-rate-ends-on-august-11/"><span style="color: #2f99bb">Hadoop World: early-bird rate ends on August 11</span></a> &#8211; August 9</strong></h2>
<p>The early-bird registration window may have passed, however, it is not too late to register for Hadoop World. <a href="http://hadoopworld2010.eventbrite.com/">Register Now!</a></p>
<h2><strong><a href="http://www.cloudera.com/blog/2010/08/clouderas-henry-robinson-to-speak-at-hadoop-day-in-seattle/"><span style="color: #2f99bb">Cloudera’s Henry Robinson to speak at Hadoop Day in Seattle</span></a><span style="color: #2f99bb"> </span>- August 10</strong></h2>
<p>At Seattle&#8217;s Hadoop Day Cloudera&#8217;s Henry Robinson gave a speech entitled &#8220;Inside Flume.&#8221; Click the above title to read a short abstract, and the slides of his presentation are available by clicking <a href="http://www.slideshare.net/cloudera/inside-flume">here</a>.</p>
<h2><strong><a href="http://www.cloudera.com/blog/2010/08/cdh3b2-release-recap/"><span style="color: #2f99bb">CDH3b2 Release Recap</span></a> &#8211; August 11</strong></h2>
<p>This post briefly explains CDH3b2 and contains links to the components of this package.</p>
<h2><strong><a href="http://www.cloudera.com/blog/2010/08/avoiding-common-hadoop-administration-issues/"><span style="color: #2f99bb">Avoiding Common Hadoop Administration Issues</span></a> &#8211; August 12</strong></h2>
<p>Cloudera&#8217;s support services see many issues on a regular basis. This post runs through some common administration issues we have come across and how to avoid them.</p>
<h2><strong><a href="http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/"><span style="color: #2f99bb">Hadoop/HBase Capacity Planning</span></a> &#8211; August 17</strong></h2>
<p>This blog provides guidance in sizing your first Hadoop/HBase cluster.</p>
<h2><strong><a href="http://www.cloudera.com/blog/2010/08/hadoopworld-training/"><span style="color: #2f99bb">Hadoop World: NYC – Training</span></a> &#8211; August 19</strong></h2>
<p>This post contains brief summaries of the trainings available surrounding Hadoop World. Space is still available, <a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/training/">sign up now!</a></p>
<h2><strong><a href="http://www.cloudera.com/blog/2010/08/improving-hotel-search-hadoop-orbitz-worldwide/"><span style="color: #2f99bb">Improving Hotel Search: Hadoop @ Orbitz Worldwide</span></a> &#8211; August 23</strong></h2>
<p>Jonathan Seidman provides a use-case from Orbitz, which will be further covered at <a href="http://hadoopworld2010.eventbrite.com/">Hadoop World </a>on October 12th.</p>
<h2><a href="http://www.cloudera.com/blog/2010/08/london-hadoop-administrative-training-certificatio/"><span style="color: #2f99bb">Hadoop Administrator Training Comes to London</span></a> &#8211; August 24</h2>
<p>There will be Hadoop Training in London for Administrators and Developers alike. These sessions are September 9th, so sign up ASAP, use the promotion code in this post for a discount!</p>
<h2><strong><a href="http://www.cloudera.com/blog/2010/08/hadoop-for-fraud-detection-and-prevention/"><span style="color: #2f99bb">Using Hadoop for Fraud Detection and Prevention</span></a> &#8211; August 24</strong></h2>
<p>Hadoop has been proven very useful in fraud detection and prevention. This post covers how Hadoop can help solve these problems.</p>
<h2><a href="http://www.cloudera.com/blog/2010/08/what%E2%80%99s-new-in-apache-hadoop-0-21/"><span style="color: #2f99bb">What’s New in Apache Hadoop 0.21</span></a> &#8211; August 26</h2>
<p>Tom White, author of <em>Hadoop: The Definitive Guide</em> gives us an overview of the changes and improvements with Apache Hadoop 0.21.</p>
<h2><a href="http://www.cloudera.com/blog/2010/08/hadoop-world-2010-speaker-highlights/"><span style="color: #2f99bb">Hadoop World 2010: Speaker Highlights</span></a> &#8211; August 30</h2>
<p>A brief glimpse into three of the thirty-six presentations that will be given at Hadoop World. This blog has presentation abstracts for General Electric, Bank of America, and eBay.</p>
<p><strong><em>This concludes the summary of Cloudera blog posts for August 2010. Be sure to follow Cloudera as we will continue to provide updates on Apache Hadoop and Hadoop-related projects.</em></strong></p>
<p><strong><em><a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/"><span style="color: #2f99bb">Be sure to attend Hadoop World October 12th in New York City!</span></a></em></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/09/summary-of-august-posts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tracing with Avro</title>
		<link>http://www.cloudera.com/blog/2010/09/tracing-with-avro/</link>
		<comments>http://www.cloudera.com/blog/2010/09/tracing-with-avro/#comments</comments>
		<pubDate>Fri, 03 Sep 2010 14:00:30 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[Avro]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4639</guid>
		<description><![CDATA[Written by Patrick Wendell, an amazing summer intern with Cloudera and an Avro Committer.
  
 In my summer internship project at Cloudera, I added RPC tracing as a first-order feature of Apache Avro. Avro is a platform for data storage and exchange that caters to data-intensive, dynamic applications. My project focused on Avro’s RPC [...]]]></description>
			<content:encoded><![CDATA[<p><em><strong>Written by Patrick Wendell, an amazing summer intern with Cloudera and an Avro Committer.</strong></em></p>
<p><em><strong> </strong></em><em><strong> </strong></em></p>
<p><em><strong> </strong></em>In my summer internship project at Cloudera, I added RPC tracing as a first-order feature of Apache Avro. Avro is a platform for data storage and exchange that caters to data-intensive, dynamic applications. My project focused on Avro’s RPC functionality.</p>
<p>It is common knowledge that tracing in distributed systems can be difficult. In user-facing web services, a front-end function may recursively trigger several function calls to mid and back-tier services. In offline processing, data-center storage layers may distribute data across several hosts, querying one or many of them when a client requests a file. In either case, the inter-dependency of components makes it difficult to pinpoint the source of a slowdown or hang-up when they inevitably occur.</p>
<div>
<p>AvroTrace is designed as a first responder for diagnosing problems in distributed systems that use Avro for RPC transport. It has two components, a real-time monitoring dashboard and an offline trace analyzer. Both run as low-overhead Avro plugins which store and propagate tracing meta-data among RPC clients and servers. The monitoring dashboard is accessible via a web interface on any Avro server, delivering a “snapshot” of the most recent RPC activity. The offline analysis tool offers a basic interface for collecting, aggregating, and analyzing this data to identify problem spots. It is largely based on <a href="http://research.google.com/pubs/pub36356.html"><span style="font-weight: normal"><span style="font-style: normal">Google’s Dapper</span></span></a><span style="font-weight: normal"><span style="font-style: normal"> tracing infrastructure, which is itself inspired by </span></span><a href="http://www.x-trace.net/wiki/doku.php"><span style="font-weight: normal"><span style="font-style: normal">X-Trace</span></span></a><span style="font-weight: normal"><span style="font-style: normal"> and other academic tracing research.</span></span></p>
<p>Below is an example trace analysis of a recursive RPC call pattern. In the example application,  one remote call, getFile() triggers two other RPC’s, getFileContents() and getFileMeta(). Avro’s tracing has detected this particular pattern and offers a dashboard view summarizing average timing and payload data. It is also showing detailed graphs for one of the specific nodes in this pattern, getFileContents() presenting a visual history of timing (top) and payload (bottom) analytics.</p>
<p>Turnkey tracing is just one of many reasons to use Avro.  I recently became a committer on the Avro project and I look forward to supporting and improving trace functionality in the coming months!</p>
<p style="text-align: center"><a href="http://www.cloudera.com/wp-content/uploads/2010/09/Untitled.png"><img class="aligncenter size-full wp-image-4657" src="http://www.cloudera.com/wp-content/uploads/2010/09/Untitled.png" alt="" width="700" /></a><em> </em></p>
<h5 style="text-align: center"><em>*Click on any of the graphs or stats for a larger version</em></h5>
<p><em><br />
</em></p>
<h2><em>Learn more about Avro and other Hadoop projects at </em><em><a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/"><span style="color: #359ac9">Hadoop World!</span></a></em></h2>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/09/tracing-with-avro/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Infochimp&#8217;s President, Philip Kromer, Interviewed Regarding Hadoop and Hadoop World</title>
		<link>http://www.cloudera.com/blog/2010/09/infochimps-president-philip-kromer-interviewed-regarding-hadoop-and-hadoop-world/</link>
		<comments>http://www.cloudera.com/blog/2010/09/infochimps-president-philip-kromer-interviewed-regarding-hadoop-and-hadoop-world/#comments</comments>
		<pubDate>Thu, 02 Sep 2010 14:00:06 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4632</guid>
		<description><![CDATA[Excitement is building as Hadoop World nears and we are sitting down with some of our presenters to ask them a few questions regarding their presentations and how they are using Hadoop within their organization. Here we speak with Philip Kromer, President of Infochimps, who  answers  questions regarding his presentation, how Hadoop is used in [...]]]></description>
			<content:encoded><![CDATA[<p>Excitement is building as Hadoop World nears and we are sitting down with some of our presenters to ask them a few questions regarding their presentations and how they are using Hadoop within their organization. Here we speak with Philip Kromer, President of <a href="http://infochimps.org/">Infochimps</a>, who  answers  questions regarding his presentation, how Hadoop is used in his business, and what he aims to get out of <a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/">Hadoop World</a>. Philip’s presentation at Hadoop World is about the development of a data marketplace and commoditization, and their chimpanzee-style approach to data processing. Attend<a href="http://hadoopworld2010.eventbrite.com/"> Hadoop World</a> October 12<sup>th</sup> in New York to hear more from and to talk to Philip.</p>
<h2><strong>What can attendees expect learn about Hadoop from your presentation at Hadoop World?</strong></h2>
<p>We&#8217;re now able to quantify aspects of human behavior never before accessible. Twitter, the News stream, the Smart Grid, are exquisite lab instruments for measuring &#8216;Conversation&#8217;, &#8216;Interest&#8217;, &#8216;Activity&#8217;. What&#8217;s more, with enough data machine-learning algorithms and big data tools let us expose insight using only the *structure*, not the content of the data. The massive quantity and connectivity required demands industrial-strength tools such as Hadoop.</p>
<p>We do *all* our data processing in high level tools (chiefly Pig and Wukong) &#8212; &#8220;black boxes with flexible glue&#8221;. We use &#8216;programmer fun&#8217; + &#8216;programmer time&#8217; as our primary development  metrics. Together, writing simple loosely coupled scripts lets us run the fast experiment-driven design cycles that a lean startup demands. It has also let us grow our own talent and recruit outside CS (physicists, in particular, dream in map reduce). I think this approach should have strong appeal to small- and medium-sized businesses, or anyone looking for low barrier-to-adoption of Hadoop.</p>
<h2><strong>Do you have Hadoop in production use today? </strong></h2>
<p>We have Hadoop in heavy production use for ad-hoc analysis and for automated processes digesting terabytes of data.</p>
<p><strong>Can you describe some use cases for Hadoop in your business?</strong></p>
<p>We have scraped data from around the web, principally Social Networks. We use Hadoop for processing it on its own and to mash it up with other open &amp; commercial datasets.</p>
<p>Examples:</p>
<ul>
<li>We have a collection of 3 billion tweets (twitter messages) from 60+million users that we tokenize into 16B+ usages of 65M terms &#8212; more than a terabyte of data on its own. Using Pig and Wukong we can identify whom to follow, to understand how events and news stories resonate, and even to find dates.</li>
<li>MLB has released a dataset describing the trajectory and full game state for every pitch of every game for the past several seasons.  Smashing this against the hourly weather data produces a laboratory able with the potential to describe the physics of a knuckleball or the performance for pitcher&#8217;s age vs. game-time temperature.</li>
</ul>
<h2><strong>How do you support Hadoop?</strong></h2>
<p>Operationally,  we use the Amazon cloud and a collection of Chef recipes (that we&#8217;ve open-sourced). These let us spin up, use, and spin down clusters of one to hundreds of machines, using either local (persistent) HDFS or just push/pull from Amazon S3.</p>
<p>We have also been supporting Hadoop by giving back to the Hadoop open-source community.</p>
<ul>
<li>Wukong (our Ruby-language toolkit for Hadoop), which we believe is the easiest and most fun way to write map-reduce programs.</li>
<li>At Hadoop World we&#8217;ll be announcing Chimpmark, a target benchmark for implementers and users of big data tools. It&#8217;s a collection of large scale datasets, accompanying challenges, and reference implementations that let you profile, tune and more deeply understand your hadoop system.</li>
<li>ClusterChef, the cluster management toolkit I described above.</li>
</ul>
<h2><strong>How has Hadoop improved your business?</strong></h2>
<p>Most of the stuff we use Hadoop for would be otherwise impossible.</p>
<h2><strong>What are you hoping to get out of your time at Hadoop World?</strong></h2>
<ul>
<li>Learn Ideas.<strong></strong></li>
<li>Popularize and receive feedback on the development of a data marketplace.<strong></strong></li>
<li>Hear where the world of Big Data is going.<strong></strong></li>
</ul>
<p style="text-align: center">At <a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/">Hadoop World</a> you can hear more from Philip Kromer as well as any of the thirty-five other presenters! <a href="http://hadoopworld2010.eventbrite.com/">Click here to register right away!</a><br />
<a href="http://hadoopworld2010.eventbrite.com/"><img class="size-full wp-image-4403  aligncenter" src="http://www.cloudera.com/wp-content/uploads/2010/08/hw_white2.gif" alt="" width="169" height="130" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/09/infochimps-president-philip-kromer-interviewed-regarding-hadoop-and-hadoop-world/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Register for Hadoop Training in New York and Get into Hadoop World for Free!</title>
		<link>http://www.cloudera.com/blog/2010/09/register-for-hadoop-training-in-new-york-and-get-into-hadoop-world-for-free/</link>
		<comments>http://www.cloudera.com/blog/2010/09/register-for-hadoop-training-in-new-york-and-get-into-hadoop-world-for-free/#comments</comments>
		<pubDate>Wed, 01 Sep 2010 14:00:25 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[training]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4618</guid>
		<description><![CDATA[That’s right, sign up for any of the training courses surrounding Hadoop World 2010, and receive a complimentary pass to the conference! There are seven different courses on offer, so whether you are new to Hadoop or looking to deepen your skills, you’ll find something to fit your needs.
If you are a manager trying to [...]]]></description>
			<content:encoded><![CDATA[<p>That’s right, sign up for any of the <a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/training/">training courses</a> surrounding Hadoop World 2010, and receive a complimentary pass to the conference! There are seven different courses on offer, so whether you are new to Hadoop or looking to deepen your skills, you’ll find something to fit your needs.</p>
<p>If you are a manager trying to decide whether Hadoop is an appropriate technology for your organization, <a href="http://www.eventbrite.com/event/762237874">Hadoop Essentials for Managers</a> will answer your questions. We will show you when using Hadoop is appropriate, what Hadoop is being used for in a range of industries, how Hadoop fits into your existing environment and what you need to know in order to deploy it within your organization.</p>
<p>Why not turn your Hadoop World trip into a multiple day Hadoop learning extravaganza by attending one of our two-day sessions? Both the <a href="http://www.eventbrite.com/event/762320120">developer</a> and <a href="http://www.eventbrite.com/event/762677188">administrator</a> training courses culminate in an exam which, when passed, confers Cloudera Certified Hadoop Developer or Administrator status.</p>
<p>For the developer with an existing understanding of Hadoop and ready to utilize Hive and Pig for their data analysis, there is a <a href="http://www.eventbrite.com/event/762318114">two-day class</a> teaching you how to process data using filters, joins, user-defined functions and more.</p>
<p>For those looking to deploy HBase, consider our one-day HBase <a href="http://www.eventbrite.com/event/762317111">training session</a>. Learn how to use HBase as a distributed data store to achieve low-latency queries and highly scalable throughput. This class covers HBase architecture, data modeling, and the Java API as well as some advanced topics and best practices.</p>
<p>If you’re a developer who is completely new to Hadoop, we have put together a <a href="http://www.eventbrite.com/event/762326138">course</a> that will provide you with a solid foundation in large scale data processing using MapReduce and Hadoop. This course is purposely offered the day before Hadoop World, so that while in attendance you will be able to better grasp the topics at the conference with your fresh Hadoop knowledge. Once you have taken this course and are comfortable with Hadoop, feel free to also enroll in a <a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/training/">training</a> course followed by Certification to document your new-found Hadoop knowledge.</p>
<p>For developers who wish to simplify interacting with Hadoop, <a href="http://www.eventbrite.com/event/764021208">Cloudera HUE</a> provides back- and front-end APIs to deliver a rich, web-based, graphical user experience. This <a href="http://www.eventbrite.com/event/764021208">class</a> covers using the HUE APIs to develop your own rich, graphical applications built on top of the HUE platform.</p>
<p>Once again, you will receive free entry to <a href="http://www.eventbrite.com/event/764021208">Hadoop World</a> if you are registered in any of the <a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/training/">training sessions</a> surrounding the event! Don’t miss out on this opportunity to broaden your knowledge, and we hope to see you there!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/09/register-for-hadoop-training-in-new-york-and-get-into-hadoop-world-for-free/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop World 2010: Speaker Highlights</title>
		<link>http://www.cloudera.com/blog/2010/08/hadoop-world-2010-speaker-highlights/</link>
		<comments>http://www.cloudera.com/blog/2010/08/hadoop-world-2010-speaker-highlights/#comments</comments>
		<pubDate>Mon, 30 Aug 2010 15:00:58 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hadoopworld]]></category>
		<category><![CDATA[training]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4436</guid>
		<description><![CDATA[Hadoop is increasingly being adopted by many Fortune 500 enterprises. Some of the speakers featured at Hadoop World this year include leading companies who have been able to create new value for their business using Hadoop. The presentations at Hadoop World are focused on how Hadoop is solving business problems for these enterprises.  Below are [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify">Hadoop is increasingly being adopted by many Fortune 500 enterprises. Some of the speakers featured at Hadoop World this year include leading companies who have been able to create new value for their business using Hadoop. The presentations at Hadoop World are focused on how Hadoop is solving business problems for these enterprises.  Below are three examples of leading enterprises that will present how Hadoop has impacted their businesses.</p>
<p style="text-align: center"><a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/"><img class="size-full wp-image-4403 aligncenter" src="http://www.cloudera.com/wp-content/uploads/2010/08/hw_white2.gif" alt="" width="169" height="130" /></a></p>
<p style="text-align: justify"><strong><a href="http://www.ge.com">GE, Product Manager, Linden Hillenbrand</a></strong>, will be talking about how Hadoop has improved GE’s Marketing &amp; Communications functions.  One capability GE has implemented is assessing the external perception of GE&#8211;positive, neutral, or negative&#8211;through various marketing campaigns.</p>
<p style="text-align: justify"><strong><a href="http://www.bofa.com">Managing Director, Big Data &amp; Analytics at Bank of America, Abhishek Mehta</a></strong>, will  present “The Business of Big Data.” This presentation will discuss how an organization with established and legacy infrastructure, technology and business processes can adopt Hadoop technologies and processes to find groundbreaking solutions to known problems.</p>
<p style="text-align: justify"><strong><a href="http://www.ebay.com">eBay Engineering director of Analytical Platform Development, Anil Madan</a></strong>, is presenting “Hadoop at eBay.” One of eBay’s largest assets is the large amount of user data they have collected. By sourcing huge volumes of this data into the HDFS cluster and running click stream and transactional data analysis eBay gets a better understanding of user behavior as well as search quality.</p>
<p style="text-align: justify">Hadoop World is a great way to learn how Hadoop is being used to power today’s modern enterprises. These presentations will help you understand how Hadoop improves your data storage and processing environment and directly impacts your business.</p>
<p style="text-align: justify">Don’t miss out! <a href="http://hadoopworld2010.eventbrite.com/">Register now</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/hadoop-world-2010-speaker-highlights/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What’s New in Apache Hadoop 0.21</title>
		<link>http://www.cloudera.com/blog/2010/08/what%e2%80%99s-new-in-apache-hadoop-0-21/</link>
		<comments>http://www.cloudera.com/blog/2010/08/what%e2%80%99s-new-in-apache-hadoop-0-21/#comments</comments>
		<pubDate>Thu, 26 Aug 2010 23:53:29 +0000</pubDate>
		<dc:creator>Tom White</dc:creator>
				<category><![CDATA[general]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4519</guid>
		<description><![CDATA[Apache Hadoop 0.21.0 was released on August 23, 2010. The last major release was 0.20.0 in April last year, so it&#8217;s not surprising that there are so many changes in this release, given the amount of activity in the Hadoop development community. In fact, there were over 1300 issues fixed in JIRA (Common, HDFS, MapReduce), [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://hadoop.apache.org/common/docs/r0.21.0/">Apache Hadoop 0.21.0</a> was released on August 23, 2010. The last major release was 0.20.0 in April last year, so it&#8217;s not surprising that there are so many changes in this release, given the amount of activity in the Hadoop development community. In fact, there were over 1300 issues fixed in JIRA (<a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;mode=hide&amp;sorter/order=DESC&amp;sorter/field=priority&amp;pid=12310240&amp;fixfor=12313563">Common</a>, <a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;mode=hide&amp;sorter/order=DESC&amp;sorter/field=priority&amp;pid=12310942&amp;fixfor=12314046">HDFS</a>, <a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;mode=hide&amp;sorter/order=DESC&amp;sorter/field=priority&amp;pid=12310941&amp;fixfor=12314045">MapReduce</a>), the issue tracker used for Apache Hadoop development. Bear in mind that the 0.21.0 release, like all dot zero releases, isn&#8217;t suitable for production use.</p>
<p>With such a large delta from the last release, it is difficult to grasp the important new features and changes. This post is intended to give a high-level view of some of the more significant features introduced in the 0.21.0 release. Of course, it can&#8217;t hope to cover everything, so please consult the release notes (<a href="http://hadoop.apache.org/common/docs/r0.21.0/releasenotes.html">Common</a>, <a href="http://hadoop.apache.org/hdfs/docs/r0.21.0/releasenotes.html">HDFS</a>, <a href="http://hadoop.apache.org/mapreduce/docs/r0.21.0/releasenotes.html">MapReduce</a>) and the change logs (<a href="http://hadoop.apache.org/common/docs/r0.21.0/changes.html">Common</a>, <a href="http://hadoop.apache.org/hdfs/docs/r0.21.0/changes.html">HDFS</a>, <a href="http://hadoop.apache.org/mapreduce/docs/r0.21.0/changes.html">MapReduce</a>) for the full details. Also, please let us know in the comments of any features, improvements, or bug fixes that you are excited about.</p>
<p>You can download Hadoop 0.21.0 from an <a href="http://www.apache.org/dyn/closer.cgi/hadoop/core/">Apache Mirror</a>. Thanks to everyone who contributed to this release!</p>
<p><span id="more-4519"></span></p>
<h2>Project Split</h2>
<p>Organizationally, a significant chunk of work has arisen from the project split, which transformed a single Hadoop project (called Core) into three constituents: <a href="http://hadoop.apache.org/common">Common</a>, <a href="http://hadoop.apache.org/hdfs">HDFS</a>, and <a href="http://hadoop.apache.org/mapreduce">MapReduce</a>. HDFS and MapReduce both have dependencies on Common, but (other than for running tests) MapReduce has no dependency on HDFS. This separation emphasizes the fact that MapReduce can run on alternative distributed file systems (although HDFS is still the best choice for sheer throughput and scalability), and it has made following development easier since there are now separate lists for each subproject. There is one release tarball still, however, although it is laid out a little differently from previous releases, since it has a subdirectory containing each of the subproject source files.</p>
<p>From a user&#8217;s point of view little has changed as a result of the split. The configuration files are divided into <em>core-site.xml</em>, <em>hdfs-site.xml</em>, and <em>mapred-site.xml</em> (this was supported in 0.20 too), and the control scripts are now broken into three (<a href="https://issues.apache.org/jira/browse/HADOOP-4868">HADOOP-4868</a>): in addition to the <em>bin/hadoop</em> script, there is a <em>bin/hdfs</em> script and a <em>bin/mapreduce</em> script for running HDFS and MapReduce daemons and commands, respectively. The <em>bin/hadoop</em> script still works as before, but issues a deprecation warning. Finally, you will need to set the <code>HADOOP_HOME</code> environment variable to have the scripts work smoothly.</p>
<h2>Common</h2>
<p>The 0.21.0 release is technically a minor release (traditionally Hadoop 0.x releases have been major, and have been allowed to <a href="http://wiki.apache.org/hadoop/Roadmap">break compatibility</a> with the previous 0.x-1 release) so it is API compatible with 0.20.2. To make the intended stability and audience of a particular API in Hadoop clear to users, all Java members with public visibility have been marked with <strong>classification annotations</strong> to say whether they are <code>Public</code>, or <code>Private</code> (there is also <code>LimitedPrivate</code> which signifies another, named, project may use it), and whether they are <code>Stable</code>, <code>Evolving</code>, or <code>Unstable</code> (<a href="https://issues.apache.org/jira/browse/HADOOP-5073">HADOOP-5073</a>). Only elements marked as <code>Public</code> appear in the user Javadoc (<a href="http://hadoop.apache.org/common/docs/r0.21.0/api/index.html">Common<a>, <a href="http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/index.html">MapReduce</a>; note that HDFS is all marked as private since it is accessed through the <code>FileSystem</code> interface in Common). The classification interface is descibed in detail in <a href="http://developer.yahoo.net/blogs/hadoop/2010/05/towards_enterpriseclass_compat.html">Towards Enterprise-Class Compatibility for Apache Hadoop</a> by Sanjay Radia.</p>
<p>This release has seen some significant improvements to <strong>testing</strong>. The <strong>Large-Scale Automated Test Framework</strong>, known as Herriot (<a href="https://issues.apache.org/jira/browse/HADOOP-6332">HADOOP-6332</a>), allows developers to <a href="http://wiki.apache.org/hadoop/HowToUseSystemTestFramework">write tests</a> that run against a real (possibly large) cluster. While there are only a dozen or so tests at the moment, the intention is that more tests will be written over time so that regression tests can be shared and run against new Hadoop release candidates, thereby making Hadoop upgrades more predictable for users.</p>
<p>Hadoop 0.21 also introduces a <strong>fault injection framework</strong>, which uses AOP to <a href="http://hadoop.apache.org/hdfs/docs/r0.21.0/faultinject_framework.html">inject faults</a> into a part of the system that is running under test (e.g. a datanode), and asserts that the system reacts to the fault in the expected manner. Complementing fault injection is mock object testing, which tests code &#8220;in the small&#8221;, at the class-level rather than the system-level. Hadoop has a growing number of <strong>Mockito-based tests</strong> for this purpose (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-1050">MAPREDUCE-1050</a>). </p>
<p>Among the many other improvements and new features, a couple of small ones stand out: the ability to <strong>retrieve metrics and configuration</strong> from Hadoop daemons by accessing the URLs <em>/metrics</em> and <em>/conf</em> in a browser (<a href="https://issues.apache.org/jira/browse/HADOOP-5469">HADOOP-5469</a>, <a href="https://issues.apache.org/jira/browse/HADOOP-6408">HADOOP-6408</a>).</p>
<h2>HDFS</h2>
<p>Support for <strong>appends</strong> in HDFS has had a rocky history. The feature was introduced in the 0.19.0 release, and then disabled in 0.19.1 due to <a href="https://issues.apache.org/jira/browse/HADOOP-5224">stability issues</a>. The good news is that the append call is back in 0.21.0 with a brand new implementation (<a href="https://issues.apache.org/jira/browse/HDFS-265">HDFS-265</a>), and may be accessed via <code>FileSystem</code>&#8217;s <code>append()</code> method. Closely related&mdash;and more interesting for many applications, such as HBase&mdash;is the <code>Syncable</code> interface that <code>FSDataOutputStream</code> now implements, which brings sync semantics to HDFS (<a href="https://issues.apache.org/jira/browse/HADOOP-6313">HADOOP-6313</a>).</p>
<p>Hadoop 0.21 has a <strong>new filesystem API</strong>, called <code>FileContext</code>, which makes it easier for applications to work with multiple filesystems (<a href="https://issues.apache.org/jira/browse/HADOOP-4952">HADOOP-4952</a>). The API is not in widespread use yet (e.g. it is not integrated with MapReduce), but it has some features that the old <code>FileSystem</code> interface doesn&#8217;t, notably support for <strong>symbolic links</strong> (<a href="https://issues.apache.org/jira/browse/HADOOP-6421">HADOOP-6421</a>, <a href="https://issues.apache.org/jira/browse/HDFS-245">HDFS-245</a>).</p>
<p>The <strong>secondary namenode has been deprecated</strong> in 0.21. Instead you should consider running a <a href="http://hadoop.apache.org/hdfs/docs/r0.21.0/hdfs_user_guide.html#Checkpoint+Node">checkpoint node</a> (which essentially acts like a secondary namenode) or a <a href="http://hadoop.apache.org/hdfs/docs/r0.21.0/hdfs_user_guide.html#Backup+Node">backup node</a> (<a href="https://issues.apache.org/jira/browse/HADOOP-4539">HADOOP-4539</a>). By using a backup node you no longer need an NFS-mount for namenode metadata, since it accepts a stream of filesystem edits from the namenode, which it writes to disk.</p>
<p>New in 0.21 is the <strong>offline image viewer</strong> (oiv) for HDFS image files (<a href="https://issues.apache.org/jira/browse/HADOOP-5467">HADOOP-5467</a>). This <a href="http://hadoop.apache.org/hdfs/docs/r0.21.0/hdfs_imageviewer.html">tool</a> allows admins to analyze HDFS metadata without impacting the namenode (it also works with older versions of HDFS). There is also a <strong>block forensics tool</strong> for finding corrupt and missing blocks from the HDFS logs (<a href="https://issues.apache.org/jira/browse/HDFS-567">HDFS-567</a>).</p>
<p>Modularization continues in the platform with the introduction of <strong>pluggable block placement</strong> (<a href="https://issues.apache.org/jira/browse/HDFS-385">HDFS-385</a>), an expert-level interface for developers who want to try out new placement algorithms for HDFS. </p>
<p>Other notable new features include:</p>
<ul>
<li>Support for efficient <strong>file concatenation in HDFS</strong> (<a href="https://issues.apache.org/jira/browse/HDFS-222">HDFS-222</a>)</li>
<li><strong>Distributed RAID filesystem</strong> (<a href="https://issues.apache.org/jira/browse/HDFS-503">HDFS-503</a>) &#8211; an erasure coding filesystem running on HDFS, designed for archival storage since the replication factor is reduced from 3 to 2, while keeping the likelihood of data loss about the same. (Note that the RAID code is a MapReduce contrib module since it has a dependency on MapReduce for generating parity blocks.)</li>
</ul>
<h2>MapReduce</h2>
<p>The biggest user-facing change in MapReduce is the status of the <strong>new API</strong>, sometimes called &#8220;context objects&#8221;. The new API is now more broadly supported since the MapReduce libraries (in <code>org.apache.hadoop.mapreduce.lib</code>) have been ported to use it (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-334">MAPREDUCE-334</a>). The examples all use the new API too (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-271">MAPREDUCE-271</a>). Nevertheless, to give users more time to migrate to the new API, the old API has been un-deprecated in this release (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-1735">MAPREDUCE-1735</a>), which means that existing programs will compile without deprecation warnings.</p>
<p>The <code>LocalJobRunner</code> (for trying out MapReduce programs on small local datasets) has been enhanced to make it more like running MapReduce on a cluster. It now supports the <strong>distributed cache</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-476">MAPREDUCE-476</a>), and can <strong>run mappers in parallel</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-1367">MAPREDUCE-1367</a>).</p>
<p>Distcp has seen a number of small improvements too, such as <strong>preserving file modification times</strong> (<a href="https://issues.apache.org/jira/browse/HADOOP-5620">HADOOP-5620</a>), <strong>input file globbing</strong> (<a href="https://issues.apache.org/jira/browse/HADOOP-5472">HADOOP-5472</a>), and <strong>preserving the source path</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-642">MAPREDUCE-642</a>).</p>
<p>Continuing the testing theme, this release is the first to feature <strong>MRUnit</strong>, a contrib module that helps users write unit tests for their MapReduce jobs (<a href="https://issues.apache.org/jira/browse/HADOOP-5518">HADOOP-5518</a>).</p>
<p>Other new contrib modules include <strong>Rumen</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-751">MAPREDUCE-751</a>) and <strong>Mumak</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-728">MAPREDUCE-728</a>), tools for modelling MapReduce. The two are designed to work together: Rumen extracts job data from historical logs, which Mumak then uses to simulate MapReduce applications and clusters on a cluster. <a href="http://developer.yahoo.net/blogs/hadoop/2010/04/gridmix3_emulating_production.html">Gridmix3</a> is also designed to work with Rumen traces. The <strong>job history log analyzer</strong> is another tool that gives information about MapReduce cluster utilization (<a href="https://issues.apache.org/jira/browse/HDFS-459">HDFS-459</a>).</p>
<p>On the job scheduling front there have been updates to the Fair Scheduler, including <strong>global scheduling</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-548">MAPREDUCE-548</a>), <strong>preemption</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-551">MAPREDUCE-551</a>), and support for <strong>FIFO pools</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-706">MAPREDUCE-706</a>). Similarly, the Capacity Scheduler now supports <strong>hierarchical queues</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-824">MAPREDUCE-824</a>), and admin-defined <strong>hard limits</strong> (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-532">MAPREDUCE-532</a>). There is also a brand new scheduler, the Dynamic Priority Scheduler, which <a href="https://issues.apache.org/jira/browse/HADOOP-4768?focusedCommentId=12763348&amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12763348">dynamically changes queue shares using a pricing model</a> (<a href="https://issues.apache.org/jira/browse/HADOOP-4768">HADOOP-4768</a>).</p>
<p><strong>Smarter speculative execution</strong> has been added to all schedulers using a more robust algorithm, called <a href="http://www.usenix.org/event/osdi08/tech/full_papers/zaharia/zaharia_html/">Longest Approximate Time to End (LATE)</a> (<a href="https://issues.apache.org/jira/browse/HADOOP-2141">HADOOP-2141</a>).</p>
<p>Finally, a couple of smaller changes:</p>
<ul>
<li><strong>Streaming combiners</strong> are now supported, so that the <code>-combiner</code> option may specify any streaming script or executable, not just a Java class. (<a href="https://issues.apache.org/jira/browse/HADOOP-4842">HADOOP-4842</a>)</li>
<li>On the successful completion of a job, the MapReduce runtime creates a <strong><em>_SUCCESS</em> file</strong> in the output directory. This may be useful for applications that need to see if a result set is complete just by inspecting HDFS. (<a href="https://issues.apache.org/jira/browse/MAPREDUCE-947">MAPREDUCE-947</a>)</li>
</ul>
<h2>What&#8217;s Not In</h2>
<p>Finally, it bears mentioning what didn&#8217;t make it into 0.21.0. The biggest omission is the new Kerberos authentication work from Yahoo! While a majority of the patches are included, security is turned off by default, and is unlikely to work if enabled (certainly there is no guarantee that it will provide any level of security, since it is incomplete). A full working security implementation will be available in 0.22, and also the next version of <a href="http://www.cloudera.com/hadoop/">CDH</a>.</p>
<p>Also, Sqoop, which was initially developed as a Hadoop contrib module, is not in 0.21.0, since it was moved out to become a standalone open source project <a href="http://wiki.github.com/cloudera/sqoop/">hosted on github</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/what%e2%80%99s-new-in-apache-hadoop-0-21/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Using Hadoop for Fraud Detection and Prevention</title>
		<link>http://www.cloudera.com/blog/2010/08/hadoop-for-fraud-detection-and-prevention/</link>
		<comments>http://www.cloudera.com/blog/2010/08/hadoop-for-fraud-detection-and-prevention/#comments</comments>
		<pubDate>Wed, 25 Aug 2010 05:27:20 +0000</pubDate>
		<dc:creator>Alex Kozlov</dc:creator>
				<category><![CDATA[HDFS]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[fraud]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4478</guid>
		<description><![CDATA[Learn about fraud and how to prevent it with Hadoop]]></description>
			<content:encoded><![CDATA[<p>Fraud has multiple meanings and the term can be easily abused.  The definition of fraud has undergone multiple changes throughout the years and is elusive as well as fraud itself.  The modern legal definition of fraud usually contains a few elements that have to be proven in court and depends on the state/country.  For example, in California, the elements of fraud, which give rise to the fraud cause of action in the California Courts, are: (a) misrepresentation (false representation, concealment, or nondisclosure); (b) knowledge of falsity (or scienter); (c) intent to defraud, i.e., to induce reliance; (d) justifiable reliance; and (e) resulting damage.  A more general definition may contain up to <a href="http://en.wikipedia.org/wiki/Fraud#Elements_of_fraud">9 elements</a>.</p>
<p>
From the statistical or technical perspective, fraud is a rare event that results in a significant financial impact to the organization.</p>
<p>
Both definitions emphasize that the event is rare (assuming that most of the population is law-abiding citizens), is intentional (there is no “accidental” fraud), as well as imply a significant damage caused to the defrauded party (otherwise why bother).  Fraud detection is difficult from statistical point of view for exactly these reasons: (a) the events are rare and it is difficult to build a predictive model and (b) fraud assumes a real human being behind it and incorporates elements of game theory since the fraudster is often an insider who knows how to game the system.</p>
<p><h3>Fraud and Rare Events</h3>
<p>By definition, fraud is an unexpected or rare event with significant financial or other damage.  Fraud assumes that the fraudster has some prior information how the current system works including previous successful and unsuccessful fraud cases and possibly the fraud detection mechanisms.  The above breaks the standard statistical modeling assumption, the variable independence or i.i.d. assumption, making building a reliable statistical model difficult.  Often the fraudster is working in the same industry that the fraud detection is supposed to protect, is intimately familiar with the fraud detection methods, and is actively trying to avoid detection by masquerading.</p>
<p>
Rare event detection problem is also applicable to online advertising and marketing, particularly with predicting “long tail” events and terrorism detection.</p>
<p>
One common example of fraud is associated with <a href="http://en.wikipedia.org/wiki/Taleb_distribution" target="_blank">Taleb distribution</a> where a seemingly high probability of a small gain shadows a small probability of a large loss that more than outweighs the gains.  Relatively long periods of slightly better than moderate gains are interrupted by a rare event of large losses.  It is easy to defraud investors by presenting the results of partial analysis excluding the “rare events”.</p>
<p><h3>Fraud Prevention</h3>
<p>Since fraud is so hard to prove in courts, most organizations and individuals try to prevent fraud from happening by blanket measures.  This includes limiting the amount of damage the fraudster can impact on the organization as well as early detection of fraud patterns.  For example, credit card companies can cut the credit card limit across the board in anticipation of a few negative fraud cases.  Advertisers can prevent advertising campaigns with low number of qualifying events.  And anti-terrorism agencies can prevent people with bottles of pure water from boarding the planes.  These actions are often in contrast with the company efforts to attract more customers and result in general dissatisfaction.  To the rescue are new technologies like Hadoop, Influence Diagrams and Bayesian Networks which are computationally expensive (these are NP-hard in computer science terminology) but are more accurate and predictive.</p>
<p><h3>Why Hadoop?</h3>
<p>Hadoop is a distributed system for processing large amounts of data.  In a recent Hadoop Summit 2010 Yahoo, Facebook, and other companies announced that they currently process a few TBs of data per day and the volumes are <a href="http://developer.yahoo.net/blogs/theater/archives/2009/06/hadoopsummit_omalley.html" target="_blank">growing at exponential rates</a>.  Hadoop can be vital for solving the fraud detection problem because:</p>
<ol>
<li>Sampling      does not work for rare events since the chance of missing a positive fraud      case leads to significant deterioration of model quality.</li>
<li>Hadoop      can solve much harder problems by leveraging multiple cores across      thousands of machines and search through much larger problem domains.</li>
<li>Hadoop      can be combined with other tools to manage moderate to low response      latency requirements.</li>
</ol>
<p>
Let’s go through these reasons one by one.  Sampling is a common technique for modeling rare events.  One of the problems with sampling is that we cannot afford to throw away rare positive cases.  Even in a stratified or proportional sampling scheme one has to retain all positive cases since the model accuracy heavily depends on them (one can usually discard some negative cases though).  Given the above, the system still has to go through the whole dataset to sieve through the positive and negative cases.</p>
<p>
Hadoop is known for its gnawing power.  Nothing can compare with the throughput power of thousands of machines each of which has multiple cores.  As was reported recently at the Hadoop Summit 2010, the largest installations of Hadoop have 2,000 to 4,000 computers with 8 to 12 cores each, amounting to up to 48,000 active threads looking for a pattern at the same time.  This allows either (a) looking through larger periods of time to incorporate events across a larger time frame or (b) taking more sources of information into account.  It is quite common among social network companies to comb through twitter blogs in search of relevant data.</p>
<p>
Finally, one of the fraud prevention problems is latency.  The agencies want to react to an event as soon as possible, often within a few minutes of the event.  Yahoo recently reported that it can adjust its behavioral model in a response to a user click event within 5-7 minutes across several hundred of millions of customers and billions of events per day.  Cloudera has developed a tool, Flume, that can load billions of events into HDFS within a few seconds and analyze them using MapReduce.</p>
<p>
Often fraud detection is akin to “finding a needle in a haystack”.  One has to go through mountains of relevant and seemingly irrelevant information, build dependency models, evaluate the impact and thwart the fraudster actions.  Hadoop helps with finding patterns by processing mountains of information on thousands of cores in a relatively short amount of time.</p>
<p><h3>Where to look next?</h3>
<p>Techniques for fraud detection are industry-specific as a rule and often are guarded since they obviously represent valuable information for potential fraudsters.  They are often kept confidential for this reason.  Moreover, the fraud detection techniques are usually a moving target since the fraudsters quickly adjust to the new fraud detection mechanisms.</p>
<p>
One of the most publicized technical frauds is click fraud in on-line advertising.  Since advertisers are often charged on the per-click basis — so called PPC campaigns; there is a way to charge advertisers on a per-conversion basis, which we will cover shortly, but a different type of fraud emerges there where the advertiser tries to conceal the conversions — the traffic provider like a search web site has a clear incentive to inflate the number.  Additionally, an advertiser competitor may be incentivized to inflate the number to skew the original advertiser margin.  This can be achieved by a human or software agent that generates extra traffic and clicks on the competitor site.  Fraud management companies like <a href="http://www.fraudwall.com/" target="_blank">Anchor Intelligence</a> and <a href="http://www.clickforensics.com/" target="_blank">Click Forensics</a> estimate that approximately 20% to 30% of all clicks are fraud.  How do we know that a click is a fraud?</p>
<p>
Decline in the number of conversions — first and most important, if your conversion rate is normally positive (that is, you are making a profit on your ad), and all of a sudden, conversion dives into negative numbers, this could be a sign of click fraud in action.  Click fraud causes extra clicks on your ad with no actual purchases, and your conversion rate will fall accordingly.</p>
<p>
An abnormal number of clicks from the same IP address or a pattern in the access times — although this is the most obvious and easily identified form of click fraud, it is amazing how many fraudsters still use this method, particularly for quick attacks.  They may choose a to strike over a long weekend when they figure you may not be watching your log files carefully, clicking on your ad repeatedly so that when you return to work on Tuesday, your account is significantly depleted.  Part of this fraud might be unintentional when a user tries to reload a page.</p>
<p>
Large “abandonment rate”, or numbers of visitors who leave your site quickly — another indication of click fraud can be a pattern of visitors clicking on your ad, spending the minimum amount of time on your site required by your PPC search engine to establish it as a valid click (usually 30 seconds or more), and then leaving without having left the landing page at all.</p>
<p>
A large number of impressions, without the follow-through clicks or click on your ad — if you notice that there are a lot more impressions (views) of your website; this could indicate the impression fraud we discussed earlier. Artificial inflation of your ad impressions may cause your clickthrough rates to drop below the Google minimum, and your ad will be disabled.  Until you realize this, your competitors have free reign to use your keywords, sometimes at bargain prices.  As well, your relevancy ratings for search engines may drop as they record numerous impressions, but no interest shown via visits to other parts of your website, which could lead to a shutdown of your campaign.</p>
<p>
Abnormally high clicks and impressions on affiliate websites — although affiliates themselves are sometimes involved in conducting click fraud schemes, they can be victims of click fraud themselves.  If one of their competitors uses this same method of excessive clicks and impressions on an affiliate’s site, the PPC search engine will soon notice an abnormally high payment to a certain affiliate and perhaps go as far as canceling that affiliate’s account, even though he or she was not engaging in any form of click fraud.</p>
<p>
A large number of clicks coming from countries outside of your normal market area — using IP geo-location services, you can identify which country an IP address is probably coming from.</p>
<p>
In the case of performance-based advertising, the advertiser himself is interested in concealing some of the traffic, not inflating it.  Since most of the performance-based measurements is based in beacons or pixels placed on the advertiser conversion page, advertiser has an incentive to (temporarily) block the traffic from the beacon or to completely remove it from their web-site.</p>
<p>
Fraud is prevalent in telecom industry.  One of the leading commercially available fraud detection products is <a href="http://h20208.www2.hp.com/cms/solutions/ci-b/cv/frm.jsp" target="_blank">HP FMS system</a> on which the author had a pleasure to work personally.  The types of telecom fraud include:</p>
<p>
Subscription fraud — involves the acquisition of telecommunications services using stolen or false credentials and/or identity with no intention of paying. With subscription fraud, not only do service providers lose revenue, but also individual consumers are vulnerable to having their identity stolen and credit rating tarnished.</p>
<p>
Technical/network fraud — occurs when someone uses equipment or technology to gain access to a service without paying. Fraudulent calls are typically billed to the legitimate owner of the line or service.  Wireless examples include cloning of cell phones or subscriber identity module (SIM) cards. Fixed line examples include clip on or line tapping, private branch exchange (PBX) hacking and calling card fraud. Prepaid services also have a large exposure to fraud with terminal tampering via magnetic strips or SIM chips, or recharging with stolen credit card numbers.</p>
<p>
Insider fraud — occurs when individuals inside the operator provide fraudulent access to networks or otherwise thwart the ability of the operator to be paid for services used.</p>
<p>
Handset abuse — is what takes place when stolen or lost handsets are used to consume telecommunications services that are in turn paid for by the service provider.  This is an expensive liability for carriers who absorb the costs.</p>
<p>
Social engineering — is an effective fraud technique in which people unwittingly help perpetrators by providing sensitive data, illicit access or simply forwarding their calls without ever knowing they have done anything wrong.</p>
<p>
All these patterns can be detected with special MapReduce pattern detection techniques.  Flume offers low-latency stream processing capabilities.</p>
<p>
Needless to say, the fraudsters also explore the potential market and invent new innovative ways to generate fraud.  One of them is deployed by <a href="http://www.clickmonkeys.com/about" target="_blank">Click Monkeys</a> which deploys a vessel with animals next to the coast of California to generate seemingly random traffic.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/hadoop-for-fraud-detection-and-prevention/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Hadoop Administrator Training Comes to London</title>
		<link>http://www.cloudera.com/blog/2010/08/london-hadoop-administrative-training-certificatio/</link>
		<comments>http://www.cloudera.com/blog/2010/08/london-hadoop-administrative-training-certificatio/#comments</comments>
		<pubDate>Tue, 24 Aug 2010 15:00:25 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[HDFS]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[training]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4417</guid>
		<description><![CDATA[Cloudera’s Hadoop Training and Certification for System Administrators has made it across the Atlantic to London for the first time! This two-day course covers planning, deploying, maintaining, monitoring, and troubleshooting your Hadoop cluster. We’ll talk about HDFS, MapReduce, Hive, Pig, HBase, Flume and more, from the System Administrator’s point of view. Take the certification exam [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify">Cloudera’s<a href="http://www.eventbrite.com/directory?q=cloudera&amp;loc=london&amp;page=1"> Hadoop Training and Certification</a> for <a href="http://www.eventbrite.com/event/762684209">System Administrators</a> has made it across the Atlantic to London for the first time! This two-day course covers planning, deploying, maintaining, monitoring, and troubleshooting your Hadoop cluster. We’ll talk about HDFS, MapReduce, Hive, Pig, HBase, Flume and more, from the System Administrator’s point of view. Take the certification exam at the end of your training and go home with a valuable validation of your Hadoop knowledge.</p>
<p style="text-align: justify">
<p style="text-align: justify">Enter the code &#8220;london_10pct&#8221; when <a href="http://www.eventbrite.com/event/762684209">registering</a> and receive a 10% discount!</p>
<p style="text-align: center"><a href="http://www.cloudera.com/what-is-hadoop/"><img class="size-medium wp-image-4448 aligncenter" src="http://www.cloudera.com/wp-content/uploads/2010/08/hadoop+elephant_rgb-300x107.png" alt="" width="370" height="130" /></a></p>
<p style="text-align: justify">Hadoop is a rapidly growing field. Prove your expertise by attaining certification from the world’s foremost Hadoop training and consulting company.</p>
<p style="text-align: justify">.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/london-hadoop-administrative-training-certificatio/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
