<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; MapReduce</title>
	<atom:link href="http://www.cloudera.com/blog/tag/mapreduce/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Thu, 24 May 2012 17:53:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Indexing Files via Solr and Java MapReduce</title>
		<link>http://www.cloudera.com/blog/2012/03/indexing-files-via-solr-and-java-mapreduce/</link>
		<comments>http://www.cloudera.com/blog/2012/03/indexing-files-via-solr-and-java-mapreduce/#comments</comments>
		<pubDate>Fri, 02 Mar 2012 13:00:25 +0000</pubDate>
		<dc:creator>Adam Smieszny</dc:creator>
				<category><![CDATA[CDH]]></category>
		<category><![CDATA[cloudera manager]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[solr]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=13314</guid>
		<description><![CDATA[Several weeks ago, I set about to demonstrate the ease with which Solr and Map/Reduce can be integrated. I was unable to find a simple, yet comprehensive, primer on integrating the two technologies. So I set about to write one. What follows is my bare-bones tutorial on getting Solr up and running to index each [...]]]></description>
			<content:encoded><![CDATA[<p>Several weeks ago, I set about to demonstrate the ease with which <a title="Solr" href="http://lucene.apache.org/solr/" target="_blank">Solr</a> and <a title="Cloudera Distribution Including Apache Hadoop" href="http://www.cloudera.com/hadoop/" target="_blank">Map/Reduce</a> can be integrated. I was unable to find a simple, yet comprehensive, primer on integrating the two technologies. So I set about to write one.</p>
<p>What follows is my bare-bones tutorial on getting Solr up and running to index each word of the complete works of Shakespeare. Note: Special thanks to <a title="Sematext - Solr experts." href="http://sematext.com/" target="_blank">Sematext</a> for looking over the Solr bits and making sure they are sane. Check them out if you’re going to be doing a lot of work with Solr, ElasticSearch, or search in general and want to bring in the experts.</p>
<h2 style="font-size:13pt">First things first</h2>
<p>The way that I got started was by instantiating a new CentOS 6 Virtual Machine. You can pick a different flavor of Linux if that suits you; Hadoop <em>should</em> work fine on any (though advocated distros are SuSE, Ubuntu/Debian, RedHat/CentOS).</p>
<p>If you are fine with CentOS and want to skip some of the manual labor here, you can download a pre-loaded Virtual Machine from the <a title="Cloudera Downloads" href="https://ccp.cloudera.com/display/SUPPORT/Downloads" target="_blank">Cloudera Downloads section</a>, that already includes an installation of Sun Java 6u21 and CDH3u3. You can then skip ahead to installing Solr and downloading sample data as outlined below.</p>
<p>If you are proceeding with a new (virtual) machine, then follow along as follows: Make sure to disable SELinux, if applicable, and enable sshd. For CentOS6, that was done with the following commands:</p>
<pre class="code">[user@localhost ~]$ sudo chkconfig --levels 2345 sshd on
[user@localhost ~]$ /etc/init.d/sshd start
[user@localhost ~]$ vim /etc/selinux/config [set to disabled]</pre>
<p style="padding-top:10px">On that machine, download and install:</p>
<ul>
<li><strong>Java</strong> &#8211; I&#8217;d recommend Java 6u26 as it has been tested with CDH<br /> <a href="http://www.oracle.com/technetwork/java/javase/downloads/index-jsp-138363.html">http://www.oracle.com/technetwork/java/javase/downloads/index-jsp-138363.html<br /> </a>Oracle Java never seems to play nicely with the /etc/alternative system (in my experience), so I force it to be the preferred JRE the old fashioned way:</li>
</ul>
<pre class="code">[user@localhost ~]$ sudo rm /usr/bin/java
[user@localhost ~]$ sudo ln -s /usr/java/jdk1.6.0_26/jre/bin/java \
 /usr/bin/java</pre>
<ul style="padding-top:10px">
<li><strong>Solr</strong> &#8211; download and unzip/untar in whatever directory that you like. For the purpose of this article, I&#8217;ll refer to it as <br /> <a href="http://lucene.apache.org/solr/downloads.html">http://lucene.apache.org/solr/downloads.html</a></li>
<li><strong>Hadoop</strong> &#8211; I am, obviously, biased. But my recommendation would be to use Cloudera Manager (free up to 50 nodes) to set up your VM as a <a href="http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html" target="_blank">pseudo-distributed cluster</a><br /> <a href="http://www.cloudera.com/products-services/tools/">http://www.cloudera.com/products-services/tools/</a></li>
<li><strong>Sample data</strong> &#8211; Complete works of William Shakespeare. I&#8217;d recommend unzipping into a single directory. I&#8217;ll refer to it as <br /> <a href="http://www.ipl.org/div/shakespeare/">http://www.ipl.org/div/shakespeare/</a></li>
</ul>
<p>You can validate that all of the pieces are installed and running correctly by doing the following:</p>
<ul>
<li><strong>Java</strong></li>
</ul>
<pre class="code">[user@localhost ~]$ java -version
java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)</pre>
<ul style="padding-top:10px">
<li><strong>Solr</strong></li>
</ul>
<pre class="code">[user@localhost ~]$ cd /example
[user@localhost ~]$ java -jar start.jar</pre>
<p style="padding-top:10px">The goal is to get up and running quickly here, so I am opting to use the Solr example configuration. Worth noting also that when run in this manner, the Solr server will be started with the default JVM heap size &#8211; which I believe to be the smaller of {1/4 system memory or 1GB}.</p>
<p>Now, you should be able to access the Solr administration GUI (one of the niceties of Solr!) via a web browser inside your VM with the address: http://localhost:8983/solr/admin</p>
<ul>
<li><strong>Hadoop</strong></li>
</ul>
<p>You can validate that Hadoop is installed and running successfully by navigating in your VM&#8217;s browser to: http://localhost:7180, logging in as admin/admin, and seeing the following:</p>
<p><a href="http://www.cloudera.com/wp-content/uploads/2012/03/healthy_hadoop_scm.png"><img class="alignnone size-full wp-image-13316" src="http://www.cloudera.com/wp-content/uploads/2012/03/healthy_hadoop_scm.png" alt="A Healthy Hadoop (pseudo)cluster" width="899" height="295" /></a></p>
<p>You probably want to export the client config XML files (can be done with a single click via Cloudera Manager &#8211; see the Generate Client Configuration buttong), copy them to /usr/lib/hadoop/conf, and then copy the sample text into hdfs:</p>
<pre class="code">[user@localhost ~]$ hadoop fs -put &lt;shakespeare&gt; shakespeare</pre>
<h2 style="padding-top:10px;font-size:13pt">Creating the indexing code</h2>
<p>I have some history with Lucene from a past life, so the high level functionality of Solr was familiar to me. In a nutshell, you index files within Java code by creating a SolrInputDocument, which represents a single entity to index &#8211; a file or document generally &#8211; and using the .addField() to attach fields to this document that you&#8217;d later like to search.</p>
<p>The driver code for the indexer is very simple, in that it takes input file path(s) off the command line, and runs the mapper on the files that it finds. Note that it will accept a directory, and parse all of the files that it finds within.</p>
<pre class="code">public class IndexDriver extends Configured implements Tool {     

  public static void main(String[] args) throws Exception {
    //TODO: Add some checks here to validate the input path
    int exitCode = ToolRunner.run(new Configuration(),
     new IndexDriver(), args);
    System.exit(exitCode);
  }

  @Override
  public int run(String[] args) throws Exception {
    JobConf conf = <strong>new</strong> JobConf(getConf(), IndexDriver.<strong>class</strong>);
    conf.setJobName("Index Builder - Adam S @ Cloudera");
    conf.setSpeculativeExecution(<strong>false</strong>);

    // Set Input and Output paths
    FileInputFormat.<em>setInputPaths</em>(conf, <strong>new</strong> Path(args[0].toString()));
    FileOutputFormat.<em>setOutputPath</em>(conf, <strong>new</strong> Path(args[1].toString()));
    // Use TextInputFormat
    conf.setInputFormat(TextInputFormat.<strong>class</strong>);

    // Mapper has no output
    conf.setMapperClass(IndexMapper.<strong>class</strong>);
    conf.setMapOutputKeyClass(NullWritable.<strong>class</strong>);
    conf.setMapOutputValueClass(NullWritable.<strong>class</strong>);
    conf.setNumReduceTasks(0);
    JobClient.<em>runJob</em>(conf);
    <strong>return</strong> 0;
  }
}</pre>
<p style="padding-top:10px">The Map code is where things get more interesting. A couple notes before we proceed:</p>
<p><em>Solr servers may be used in 2 ways:</em></p>
<ol>
<li>Via embedding a Solr server object within your Java code using EmbeddedSolrServer</li>
<li>Via HTTP requests, using the class CommonsHttpSolrServer with a URL (in our case, http://localhost:8983/solr)</li>
</ol>
<p>In what follows, I elected to go with the <a title="JavaDoc for SUSS" href="http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html" target="_blank">StreamingUpdateSolrServer</a> &#8211; which is a subclass of CommonsHttpSolrServer. More comments on that towards the end.</p>
<p><em>I will assume now that the reader has some familiarity with the Map/Reduce programming paradigm.</em> The salient points for us here are that we will use a Map-only job to read through each file in the input that we provide, and index our chosen fields. Taking the path of least resistance, I used the fact that each line of text is it&#8217;s own Key/Value pair if we read the input as TextInputFormat, and I chose to index the following fields:</p>
<ol>
<li>As a unique identifier for each word, I concatenated the filename, line offset (conveniently provided to the Map code as the &#8220;Key&#8221; because we are using TextInputFormat), and the position on that line of the word</li>
<li>The word itself</li>
</ol>
<p><em>The Solr server obeys field definitions (specifying field names, data types, uniqueness, etc.) as dictated by a schema file.</em> For this example, running Solr as indicated above, the schema is defined by/example/solr/conf/schema.xml</p>
<p>Per my choice to index 2 distinct fields, the relevant fields in the schema are:</p>
<pre class="code">&lt;field name="id" type="string" indexed="true" \
 stored="true" required="true" /&gt;
&lt;field name="text" type="text_general" indexed="true" \
 stored="true" multiValued="true"/&gt;</pre>
<p style="padding-top:10px">Without further ado, then, the code looks like the following:</p>
<pre class="code">public class IndexMapper extends MapReduceBase implements
 Mapper &lt;LongWritable, Text, NullWritable, NullWritable&gt; {
  <strong>private</strong> StreamingUpdateSolrServer server = null;
  <strong>private</strong> SolrInputDocument thisDoc = new SolrInputDocument();
  <strong>private</strong> String fileName;
  <strong>private</strong> StringTokenizer st = null;
  <strong>private</strong> int lineCounter = 0;

  @Override
  <strong>public</strong> <strong>void</strong> configure(JobConf job) {
    String url = "http://localhost:8983/solr";
    fileName = job.get("map.input.file").substring(
      (job.get("map.input.file")).lastIndexOf(
      System.getProperty("file.separator")) +1);
      <strong>try</strong> {
        server = <strong>new</strong> StreamingUpdateSolrServer(url, 100, 5);
      } <strong>catch</strong> (MalformedURLException e) {
        e.printStackTrace();
      }
  }

  @Override
  <strong>public</strong> <strong>void</strong> map(LongWritable key, Text val,
   OutputCollector &lt;NullWritable, NullWritable&gt; output,
   Reporter reporter) <strong>throws</strong> IOException {

    st = <strong>new</strong> StringTokenizer(val.toString());
    lineCounter = 0;
    <strong>while</strong> (st.hasMoreTokens()) {
      thisDoc = <strong>new</strong> SolrInputDocument();
      thisDoc.addField("id", fileName + " "
       + key.toString() + " " + lineCounter++);
      thisDoc.addField("text", st.nextToken());
      <strong>try</strong> {
        server.add(thisDoc);
      } <strong>catch</strong> (SolrServerException e) {
        e.printStackTrace();
      } <strong>catch</strong> (IOException e) {
        e.printStackTrace();
      }
    }
  }

  @Override
  <strong>public</strong> <strong>void</strong> close() <strong>throws</strong> IOException {
  <strong>try</strong> {
      server.commit();
    } <strong>catch</strong> (SolrServerException e) {
      e.printStackTrace();
    }
  }
}</pre>
<p style="padding-top:10px">Compile the code how you see fit (I am old school and still use ant), and the job is ready to run!</p>
<p>To index all of the comedies, you can run the job with the compiled jar file as follows. Note that you must tell hadoop to include an additional Solr jar at runtime:</p>
<pre class="code">[user@localhost SolrTest]$ hadoop jar solrtest.jar \
 -libjars &lt;solr_install_dir&gt;/dist/apache-solr-solrj-3.5.0.jar \
 shakespeare/comedies shakespeare_output</pre>
<p style="padding-top:10px">If you then query the Solr server (via the web GUI at http://localhost:8983/solr/admin, the default search is *:* which works well for a quick test) you should see something like the following:</p>
<pre class="code">&lt;response&gt;
 &lt;lst name="responseHeader"&gt;
  &lt;int name="status"&gt;0&lt;/int&gt;
  &lt;int name="QTime"&gt;35&lt;/int&gt;
  &lt;lst name="params"&gt;
    &lt;str name="indent"&gt;on&lt;/str&gt;
    &lt;str name="start"&gt;0&lt;/str&gt;
    &lt;str name="q"&gt;*:*&lt;/str&gt;
    &lt;str name="version"&gt;2.2&lt;/str&gt;
    &lt;str name="rows"&gt;10&lt;/str&gt;
  &lt;/lst&gt;
 &lt;/lst&gt;
&lt;result name="response" numFound="377452" start="0"&gt;
&lt;doc&gt;
 &lt;str name="id"&gt;troilusandcressida 0 0&lt;/str&gt;
 &lt;arr name="text"&gt;
  &lt;str&gt;TROILUS&lt;/str&gt;
 &lt;/arr&gt;
&lt;/doc&gt;
&lt;doc&gt;
 &lt;str name="id"&gt;troilusandcressida 0 1&lt;/str&gt;
 &lt;arr name="text"&gt;
  &lt;str&gt;AND&lt;/str&gt;
 &lt;/arr&gt;
&lt;/doc&gt;
&lt;doc&gt;
 &lt;str name="id"&gt;troilusandcressida 0 2&lt;/str&gt;
 &lt;arr name="text"&gt;
  &lt;str&gt;CRESSIDA&lt;/str&gt;
 &lt;/arr&gt;
&lt;/doc&gt;</pre>
<p>&#8230;</p>
<h2 style="font-size:13pt">Further Tuning/Investigation Opportunities</h2>
<p><em>Performance Implications of StreamingUpdateSolrServer &#8211; possibility of using EmbeddedSolrServer:</em> What are the optimal tuning parameters for number of threads and batch size when using StreamingUpdateSolrServer? More investigation could be done here. It is also possible to use an EmbeddedSolrServer (per the Rackspace case study in <a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449389732/ref=sr_1_1?s=books&amp;ie=UTF8&amp;qid=1330367237&amp;sr=1-1" target="_blank" title="Hadoop: The Definitive Guide">Hadoop: The Definitive Guide</a>), though it does add some maintenance overhead to create the indexes in a distributed fashion and then later re-combine. I opted to use the StreamingUpdateSolrServer because I believe that it is simpler to get up and running in a small test environment.</p>
<p><em>How to minimize the memory requirements in the Map code:</em> I haven&#8217;t been a full time Java developer in many years, so there are almost certainly things that I&#8217;m missing on how to minimize the memory overhead of the objects used in the Map code. Since this is called for each line in the input, it is critical to make this code as lean as possible. One tip that I came across on this topic is to use (mutable) org.apache.hadoop.io.Text objects rather than (immutable) Strings. I avoided creating any new String objects in this example Map code, but the point is worth noting for other exercises.</p>
<h2 style="font-size:13pt">Resources that I found useful</h2>
<ul>
<li>A great primer that accomplished the indexing via Cascading:<br /> <a href="http://architects.dzone.com/articles/solr-hadoop-big-data-love">http://architects.dzone.com/articles/solr-hadoop-big-data-love</a></li>
<li>Solr Tutorial:<br /> <a href="http://lucene.apache.org/solr/tutorial.html">http://lucene.apache.org/solr/tutorial.html</a></li>
<li>Some sample code for adding, updating, deleting documents on this wiki:<br /> <a href="http://wiki.apache.org/solr/Solrj">http://wiki.apache.org/solr/Solrj</a></li>
<li><a href="http://www.cloudera.com/company/careers/" title="Cloudera Careers">My outstanding coworkers at Cloudera!</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/03/indexing-files-via-solr-and-java-mapreduce/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Data Interoperability with Apache Avro</title>
		<link>http://www.cloudera.com/blog/2011/07/avro-data-interop/</link>
		<comments>http://www.cloudera.com/blog/2011/07/avro-data-interop/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 19:13:37 +0000</pubDate>
		<dc:creator>Doug Cutting</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[Flume]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[sqoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=8075</guid>
		<description><![CDATA[The ecosystem around Apache Hadoop has grown at a tremendous rate. Folks now can use many different pieces of software to process their large data sets, and most choose to use several of these components. Data collected by Flume might be analyzed by Pig and Hive scripts. Data imported with Sqoop might be processed by [...]]]></description>
			<content:encoded><![CDATA[<p>The ecosystem around Apache Hadoop has grown at a tremendous rate. Folks now can use many different pieces of software to process their large data sets, and most choose to use several of these components.  Data collected by Flume might be analyzed by Pig and Hive scripts.  Data imported with Sqoop might be processed by a MapReduce program.  To facilitate these and other scenarios, data produced by each component must be readily consumed by other components.</p>
<h1>Data Interoperability</h1>
<p>One might address this data interoperability in a variety of manners, including the following:</p>
<ul>
<li>Each system might be extended to read all the formats generated by the other systems.  In the limit, this approach is not practical, since one cannot easily anticipate all of the formats new systems might generate.</li>
<li>A library of data conversion programs could be assembled. This would unfortunately add a processing step, to convert the data between formats, slowing processing pipelines.  Note however that many data conversion libraries operate by converting data into and out of a <em>lingua franca</em> format, using a single format as a pivot point. &#160;This hints at a third possibility.</li>
<li>Enable each system to read and write a common format. &#160;Some systems might use other formats internally for performance, but whenever data is meant to be accessible to other systems a common format is used.</li>
</ul>
<p>In practice all of these strategies will used to some extent.  However the last strategy, a common format, seems to offer the most efficient path both in terms of engineering effort and processing time.  This article will focus on the use of Avro&#8217;s data file format as such a common format.</p>
<h1>Avro</h1>
<p>Apache&#160;<a href="http://avro.apache.org/">Avro</a> is a data serialization format.  Avro shares many features with Google&#8217;s Protocol Buffers and Apache Thrift, including:</p>
<ul>
<li>Rich data types.</li>
<li>Fast, compact serialization.</li>
<li>Support for many programming languages.</li>
<li>Datatype evolution, also known as&#160;<em>versioning.</em></li>
</ul>
<p>Avro additionally provides some other features that are especially useful when storing data, namely:</p>
<ul>
<li>Avro defines a standard file format.  Avro data files are self-describing, containing the full schema for the data in the file.  Thus users can exchange Avro data files without also having to separately communicate metadata. &#160;Once an Avro data file is written, one will always be able to read it, with full datatype information, without relying on any external software or metadata repository. &#160;Avro data files also support compression, using Gzip or <a href="http://code.google.com/p/snappy/">Snappy</a> codecs. </li>
<li>Avro&#8217;s serialization is more compact.  Avro avoids storing a field identifier with each field value.  For some datasets this savings can be significant. </li>
<li>Avro implementations permit one to dynamically define new datatypes and to easily process previously unseen datatypes, without generation and loading of code.  This provides natural support for script and query languages. </li>
<li>Avro datatypes can define their sort-order, facillitating use of Avro data in MapReduce or ordered key/value stores. </li>
</ul>
<h1>Avro as a Common Format</h1>
<p>Most of the major ecosystem components already or will soon support reading and writing Avro data files:</p>
<ul>
<li>MapReduce: I added support for Java MapReduce programs, <a href="http://s.apache.org/o6">included</a> in Avro 1.4 and greater.</li>
<li><a href="http://hadoop.apache.org/common/docs/current/streaming.html">Streaming</a>: Tom White from Cloudera has added support for Hadoop Streaming programs to Avro (<a href="https://issues.apache.org/jira/browse/AVRO-808">AVRO-808</a> &amp;&#160;<a href="https://issues.apache.org/jira/browse/AVRO-830">AVRO-830</a>).</li>
<li><a href="http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/">Flume</a> 0.9.2 and above support collecting data in Avro&#8217;s format (<a href="https://issues.apache.org/jira/browse/FLUME-133">FLUME-133</a>), contributed by Jon Hsieh of Cloudera. &#160;Note also that Flume has recently been accepted into the Apache Incubator and will soon be known as Apache Flume.</li>
<li><a href="http://www.cloudera.com/blog/2009/06/introducing-sqoop/">Sqoop</a> 1.3 can import data as Avro data files in HDFS from a relational database (<a href="https://issues.cloudera.org/browse/SQOOP-207">SQOOP-207</a>), contributed by Tom White of Cloudera. &#160;Sqoop has also recently been accepted into the Apache Incubator.</li>
<li><a href="http://pig.apache.org/">Pig</a> release 0.9 will be able read and write Avro data files (<a href="https://issues.apache.org/jira/browse/PIG-1748">PIG-1748</a>), thanks to Lin Guo and Jakob Homan at LinkedIn. </li>
<li><a href="http://hive.apache.org/">Hive</a> support for reading and writing Avro data files has been <a href="https://github.com/jghoman/haivvreo#readme">posted</a> by Jakob Homan of LinkedIn, and should hopefully be included in Hive 0.9 (<a href="https://issues.apache.org/jira/browse/HIVE-895">HIVE-895</a>). </li>
<li><a href="http://incubator.apache.org/hcatalog/">HCatalog</a> input and output drivers have been contributed by Tom White of Cloudera (<a href="https://issues.apache.org/jira/browse/HCATALOG-49">HCATALOG-49</a>).</li>
<li>Thiruvalluvan M. G.&#160;from Yahoo! is working on a column-major format for Avro, which would accelerate Hive and Pig queries (<a href="https://issues.apache.org/jira/browse/AVRO-806">AVRO-806</a>).</li>
</ul>
<p>For folks who are currently using Protocol Buffers or Thrift to store data, some tools for conversion are planned:</p>
<ul>
<li>Raghu Angadi from Twitter is working on tools that will let folks     read and write their Thrift-defined data structures as Avro format data (<a href="https://issues.apache.org/jira/browse/AVRO-804">AVRO-804</a>).</li>
<li>We also hope to soon add tools to convert between Protocol Buffers and Avro (<a href="https://issues.apache.org/jira/browse/AVRO-805">AVRO-805</a>).</li>
</ul>
<p>At Cloudera we&#8217;re committed to helping Avro become a common format for the Hadoop ecosystem. &#160;It&#8217;s great to see so many other companies and individuals also investing in Avro.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/07/avro-data-interop/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>How to Include Third-Party Libraries in Your Map-Reduce Job</title>
		<link>http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/</link>
		<comments>http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/#comments</comments>
		<pubDate>Tue, 11 Jan 2011 15:29:33 +0000</pubDate>
		<dc:creator>Alex Kozlov</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[administration]]></category>
		<category><![CDATA[libraries]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=6005</guid>
		<description><![CDATA[&#8220;My library is in the classpath but I still get a Class Not Found exception in a MapReduce job&#8221; &#8211; If you have this problem this blog is for you. Java requires third-party and user-defined classes to be on the command line&#8217;s &#8220;-classpath&#8221; option when the JVM is launched. The `hadoop` wrapper shell script does [...]]]></description>
			<content:encoded><![CDATA[<p>&#8220;My library is in the classpath but I still get a Class Not Found exception in a MapReduce job&#8221; &#8211; If you have this problem this blog is for you.</p>
<p>Java requires third-party and user-defined classes to be on the command line&#8217;s &#8220;<a href="http://download.oracle.com/javase/6/docs/technotes/tools/solaris/classpath.html" target="_blank">-<em>classpath</em></a>&#8221; option when the JVM is launched.   The `hadoop` wrapper shell script does exactly this for you by building the classpath from the core libraries located in <em>/usr/lib/hadoop-0.20/</em> and <em>/usr/lib/hadoop-0.20/lib/</em> directories.  However, with MapReduce you job&#8217;s task attempts are executed on remote nodes.  How do you tell a remote machine to include third-party and user-defined classes?</p>
<p>Map-Reduce jobs are executed in separate JVMs on TaskTrackers and sometimes you need to use third-party libraries in the map/reduce task attempts.  For example, you might want to access HBase from within your map tasks.  One way to do this is to package every class used in the submittable JAR.  You will have to unpack the original <code>hbase-<version>.jar</code> and repackage all the classes in your submittable Hadoop jar.  Not good.  Don&#8217;t do this: The version compatibility issues are going to bite you sooner or later.</p>
<p>There are better ways of doing the same by either putting your jar in distributed cache or installing the whole JAR on the Hadoop nodes and telling TaskTrackers about their location.</p>
<p>1. Include the JAR in the &#8220;<em>-libjars</em>&#8221; command line option of the `hadoop jar &#8230;` command.  The jar will be placed in <a href="http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#DistributedCache">distributed cache</a> and will be made available to all of the job&#8217;s task attempts.  More specifically, you will find the JAR in one of  the <em>${mapred.local.dir}/taskTracker/archive/${user.name}/distcache/&#8230;</em> subdirectories on local nodes.  The advantage of the distributed cache is that your jar might still be there on your next program run (at least in theory:  The files should be kicked out of the distributed cache only when they exceed soft limit defined by the <em>local.cache.size</em> configuration variable, defaults to 10GB, but your actual mileage can vary particularly with the newest security enhancements).  Hadoop keeps track of the changes to the distributed cache files by examining their modification timestamp.</p>
<p><em>*Update to post: Please note that items 2 and 3 below are deprecated starting CDH4 and will be no longer supported starting CDH5.</em></p>
<p>2. Include the referenced JAR in the lib subdirectory of the submittable JAR: A MapReduce job will unpack the JAR from this subdirectory into <em>${mapred.local.dir}/taskTracker/${user.name}/jobcache/$jobid/jars</em> on the TaskTracker nodes and point your tasks to this directory to make the JAR available to your code.  If the JARs are small, change often, and are job-specific this is the preferred method.</p>
<p>3. Finally, you can install the JAR on the cluster nodes.  The easiest way is to place the JAR into <em>$HADOOP_HOME/lib</em> directory as everything from this directory is included when a Hadoop daemon starts.  However, since you know that only TaskTrackers will need these the new JAR, a better way is to modify HADOOP_TASKTRACKER_OPTS option in the hadoop-env.sh configuration file.  This method is preferred if the JAR is tied to the code running on the nodes, like HBase.</p>
<pre class="code">HADOOP_TASKTRACKER_OPTS="-classpath&lt;colon-separated-paths-to-your-jars&gt;"</pre>
<p>Restart the TastTrackers when you are done.  Do not forget to update the jar when the underlying software changes.</p>
<p>All of the above options affect only the code running on the distributed nodes.  If your code that launches the Hadoop job uses the same library, you need to include the JAR in the HADOOP_CLASSPATH environment variable as well:</p>
<pre class="code">HADOOP_CLASSPATH="&lt;colon-separated-paths-to-your-jars&gt;"</pre>
<p>Note that starting with Java 1.6 classpath can point to directories like &#8220;<em>/path/to/your/jars/*</em>&#8221; which will pick up all JARs from the given directory.</p>
<p>The same guiding principles apply to native code libraries that need to be run on the nodes (JNI or C++ pipes).  You can put them into distributed cache with the &#8220;<em>-files</em>&#8221; options, include them into archive files specified with the &#8220;<em>-archives</em>&#8221; option, or install them on the cluster nodes.  If the dynamic library linker is configured properly the native code should be made available to your task attempts.  You can also modify the environment of the job&#8217;s running task attempts explicitly by specifying JAVA_LIBRARY_PATH or LD_LIBRARY_PATH variables:</p>
<pre class="code">hadoop jar &lt;your jar&gt; [main class]
      -D mapred.child.env="LD_LIBRARY_PATH=/path/to/your/libs" ...</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Map-Reduce With Ruby Using Apache Hadoop</title>
		<link>http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/</link>
		<comments>http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/#comments</comments>
		<pubDate>Wed, 05 Jan 2011 14:00:43 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[#cdh3]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[apache hadoop]]></category>
		<category><![CDATA[CDH]]></category>
		<category><![CDATA[cdh3b2]]></category>
		<category><![CDATA[cloudera's distribution for hadoop]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=5775</guid>
		<description><![CDATA[Guest re-post from Phil Whelan, a large-scale web-services consultant based in Vancouver, BC. Here I demonstrate, with repeatable steps, how to fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will&#160;not [...]]]></description>
			<content:encoded><![CDATA[<p><em><strong>Guest re-post from Phil Whelan, a large-scale web-services consultant based in Vancouver, BC.</strong></em></p>
<p><img class="alignleft size-full wp-image-605" style="float: left; margin-right: 10px; margin-top: 5px; border: none;" title="hadoop-ruby" src="http://www.philwhln.com/wp-content/uploads/2010/12/hadoop-ruby.png" alt="Map-Reduce With Hadoop Using Ruby" width="202" height="189" /><br />
 Here I demonstrate, with repeatable steps, how to fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will&#160;<em>not</em> need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java.</p>
<p><br class="spacer_" /></p>
<ul style="margin-left: 20px; margin-top: 1px; margin-bottom: 1px;">
<li><a href="http://www.philwhln.com/map-reduce-with-ruby-using-hadoop#fire-up-your-hadoop-cluster">Fire-Up Your Hadoop Cluster</a></li>
<li><a href="http://www.philwhln.com/map-reduce-with-ruby-using-hadoop#setting-up-your-local-hadoop-client">Setting Up Your Local Hadoop Client</a></li>
<li><a href="http://www.philwhln.com/map-reduce-with-ruby-using-hadoop#defining-the-map-reduce-task">Defining The Map-Reduce Task</a></li>
<li><a href="http://www.philwhln.com/map-reduce-with-ruby-using-hadoop#uploading-your-data-to-hdfs">Uploading Your Data To HDFS (Hadoop Distributed FileSystem)</a></li>
<li><a href="http://www.philwhln.com/map-reduce-with-ruby-using-hadoop#coding-your-map-and-reduce-scripts-in-ruby">Coding Your Map And Reduce Scripts in Ruby</a></li>
<li><a href="http://www.philwhln.com/map-reduce-with-ruby-using-hadoop#running-the-hadoop-job">Running The Hadoop Job</a></li>
<li><a href="http://www.philwhln.com/map-reduce-with-ruby-using-hadoop#the-results">The Results</a></li>
<li><a href="http://www.philwhln.com/map-reduce-with-ruby-using-hadoop#conclusion">Conclusion</a></li>
<li><a href="http://www.philwhln.com/map-reduce-with-ruby-using-hadoop#resources">Resources</a></li>
</ul>
<p><a name="fire-up-your-hadoop-cluster"></a></p>
<h2>Fire-Up Your Hadoop Cluster</h2>
<p>I chose <a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','www.cloudera.com']);" href="http://www.cloudera.com/hadoop/">Cloudera&#8217;s Distribution for Apache Hadoop</a> which is 100% Apache licensed, but has some additional benefits. One of these benefits is that it is released by <a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','en.wikipedia.org']);" href="http://en.wikipedia.org/wiki/Doug_Cutting">Doug Cutting</a>, who started Hadoop and drove it&#8217;s development at Yahoo! He also started <a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','lucene.apache.org']);" href="http://lucene.apache.org/">Lucene</a>, which is another of my favourite Apache Projects, so I have good faith that he knows what he is doing. Another benefit, as you will see, is that it is simple to fire-up a Hadoop cluster.</p>
<p>I am going to use Cloudera&#8217;s <a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','archive.cloudera.com']);" href="http://archive.cloudera.com/cdh/3/whirr/">Whirr script</a>, which will allow me to fire up a production ready Hadoop cluster on <a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','aws.amazon.com']);" href="http://aws.amazon.com/ec2/">Amazon EC2</a> directly from my laptop. Whirr is built on <a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','code.google.com']);" href="http://code.google.com/p/jclouds/">jclouds</a>, meaning other cloud providers should be supported, but only Amazon EC2 has been tested. Once we have Whirr installed, we will configure a <em>hadoop.properties</em> file with our Amazon EC2 credentials and the details of our desired Hadoop cluster. Whirr will use this <em>hadoop.properties</em> file to build the cluster.</p>
<p>If you are on Debian or Redhat you can use either apt-get or yum to install whirr, but since I&#8217;m on Mac OS X, I&#8217;ll need to <a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','www.apache.org']);" href="http://www.apache.org/dyn/closer.cgi/incubator/whirr/">download the Whirr script</a>.</p>
<p>The current version of Whirr 0.2.0, hosted on the Apache Incubator site, is not compatible with Cloudera&#8217;s Distribution for Hadoop (CDH), so I&#8217;m am downloading <a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','archive.cloudera.com']);" href="http://archive.cloudera.com/cdh/3/whirr-0.1.0+23.tar.gz">version 0.1.0+23</a>.</p>
<p>
<pre class="code">mkdir ~/src/cloudera
cd ~/src/cloudera
wget http://archive.cloudera.com/cdh/3/whirr-0.1.0+23.tar.gz
tar -xvzf whirr-0.1.0+23.tar.gz</pre>
</p>
<p>To build Whirr you&#8217;ll need to install Java (version 1.6), <a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','maven.apache.org']);" href="http://maven.apache.org/download.html">Maven</a> ( >= 2.2.1) and <a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','www.ruby-lang.org']);" href="http://www.ruby-lang.org/en/">Ruby</a> ( >= 1.8.7). If you&#8217;re running with the latest Mac OS X, then you should have the latest Java and I&#8217;ll assume, due to the title of this post, that you can manage the Ruby version. If you are not familiar with Maven, you can install it via Homebrew on Mac OS X using the brew command below. On Debian use <em>apt-get install maven2</em>.</p>
<p>
<pre class="code">sudo brew update
sudo brew install maven</pre>
</p>
<p>Once the dependencies are installed we can build the whirr tool.</p>
<p>
<pre class="code">cd whirr-0.1.0+23
mvn clean install
mvn package -Ppackage</pre>
</p>
<p>In true Maven style, it will download a long list of dependencies the first time you build this. Be patient.</p>
<p>Ok, it should be built now and if you&#8217;re anything like me, you would have used the time to get a nice cuppa tea or a sandwich. Let&#8217;s sanity check the whirr script&#8230;</p>
<p>
<pre class="code">bin/whirr version</pre>
</p>
<p>You should see something like &#8220;Apache Whirr 0.1.0+23? output to the terminal.</p>
<p>Create a <em>hadoop.properties</em> file with the following content.</p>
<p>
<pre class="code">whirr.service-name=hadoop
whirr.cluster-name=myhadoopcluster
whirr.instance-templates=1 jt+nn,1 dn+tt
whirr.provider=ec2
whirr.identity=
whirr.credential=
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
whirr.hadoop-install-runurl=cloudera/cdh/install
whirr.hadoop-configure-runurl=cloudera/cdh/post-configure</pre>
</p>
<p>Replace <em> </em> and <em> </em> with your Amazon EC2 Access Key ID and Amazon EC2 Secret Access Key (I will not tell you what mine is).</p>
<p>This configuration is a little boring with only two machines. One machine for the master and one machine for the worker. You can get more creative once you are up and running. Let&#8217;s fire up our &#8220;cluster&#8221;.</p>
<p>
<pre class="code">bin/whirr launch-cluster --config hadoop.properties</pre>
</p>
<p>This is another good time to put the kettle on, as it takes a few minutes to get up and running. If you are curious, or worried that things have come to a halt then Whirr outputs a whirr.log in the current directory. Fire-up another terminal window and tail the log.</p>
<p>
<pre class="code">cd ~/src/cloudera/whirr-0.1.0+23
tail -F whirr.log</pre>
</p>
<p>16 minutes (and several cups of tea) later the cluster is up and running. Here is the output I saw in my terminal.</p>
<p>
<pre class="code">Launching myhadoopcluster cluster
Configuring template
Starting master node
Master node started: [[id=us-east-1/i-561d073b, providerId=i-561d073b, tag=myhadoopcluster, name=null, location=[id=us-east-1d, scope=ZONE, description=us-east-1d, parent=us-east-1], uri=null, imageId=us-east-1/ami-d59d6bbc, os=[name=null, family=amzn-linux, version=2010.11.1-beta, arch=paravirtual, is64Bit=false, description=amzn-ami-us-east-1/amzn-ami-2010.11.1-beta.i386.manifest.xml], userMetadata={}, state=RUNNING, privateAddresses=[10.113.23.123], publicAddresses=[72.44.45.199], hardware=[id=m1.small, providerId=m1.small, name=m1.small, processors=[[cores=1.0, speed=1.0]], ram=1740, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=150.0, device=/dev/sda2, durable=false, isBootDevice=false]], supportsImage=Not(is64Bit())]]]
Authorizing firewall
Starting 1 worker node(s)
Worker nodes started: [[id=us-east-1/i-98100af5, providerId=i-98100af5, tag=myhadoopcluster, name=null, location=[id=us-east-1d, scope=ZONE, description=us-east-1d, parent=us-east-1], uri=null, imageId=us-east-1/ami-d59d6bbc, os=[name=null, family=amzn-linux, version=2010.11.1-beta, arch=paravirtual, is64Bit=false, description=amzn-ami-us-east-1/amzn-ami-2010.11.1-beta.i386.manifest.xml], userMetadata={}, state=RUNNING, privateAddresses=[10.116.147.148], publicAddresses=[184.72.179.36], hardware=[id=m1.small, providerId=m1.small, name=m1.small, processors=[[cores=1.0, speed=1.0]], ram=1740, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=150.0, device=/dev/sda2, durable=false, isBootDevice=false]], supportsImage=Not(is64Bit())]]]
Completed launch of myhadoopcluster
Web UI available at http://ec2-72-44-45-199.compute-1.amazonaws.com
Wrote Hadoop site file /Users/phil/.whirr/myhadoopcluster/hadoop-site.xml
Wrote Hadoop proxy script /Users/phil/.whirr/myhadoopcluster/hadoop-proxy.sh
Started cluster of 2 instances
HadoopCluster{instances=[Instance{roles=[jt, nn], publicAddress=ec2-72-44-45-199.compute-1.amazonaws.com/72.44.45.199, privateAddress=/10.113.23.123}, Instance{roles=[tt, dn], publicAddress=/184.72.179.36, privateAddress=/10.116.147.148}], configuration={fs.default.name=hdfs://ec2-72-44-45-199.compute-1.amazonaws.com:8020/, mapred.job.tracker=ec2-72-44-45-199.compute-1.amazonaws.com:8021, hadoop.job.ugi=root,root, hadoop.rpc.socket.factory.class.default=org.apache.hadoop.net.SocksSocketFactory, hadoop.socks.server=localhost:6666}}</pre>
</p>
<p>Whirr has created a directory with some files in our home directory&#8230;</p>
<pre class="code">~/.whirr/myhadoopcluster/hadoop-proxy.sh
~/.whirr/myhadoopcluster/hadoop-site.xml</pre>
<p>This hadoop-proxy.sh is used to access the web interface of Hadoop securely. When we run this it will tunnel through to the cluster and give us access in the web browser via a SOCKS proxy.</p>
<p>You need to configure the SOCKS proxy in either your web browser or, in my case, the Mac OS X settings menu.</p>
<div id="attachment_508" class="wp-caption alignnone" style="width: 720px;">
<p><a rel="attachment wp-att-508" href="http://www.cloudera.com/?attachment_id=508"><img class="size-full wp-image-508" title="SOCKS Proxy Configuration" src="http://www.philwhln.com/wp-content/uploads/2010/12/Screen-shot-2010-12-28-at-3.15.55-PM.png" alt="Hadoop SOCKS Proxy Configuration for Mac OS X" width="710" height="360" /></a></p>
<p class="wp-caption-text">Hadoop SOCKS Proxy Configuration for Mac OS X</p>
</div>
<p>Now start the proxy in your terminal&#8230;</p>
<p><em>(Note: There has still been no need to ssh into the cluster. Everything in this post is done on our local machine)</em></p>
<p>
<pre class="code">sh ~/.whirr/myhadoopcluster/hadoop-proxy.sh
<em>
   Running proxy to Hadoop cluster at
   ec2-72-44-45-199.compute-1.amazonaws.com.
   Use Ctrl-c to quit.</em></pre>
</p>
<p>The above will output the hostname that you can access the cluster at. On Amazon EC2 it looks something like <em>http://ec2-72-44-45-199.compute-1.amazonaws.com:50070/dfshealth.jsp</em>. Use this hostname to view the cluster in your web browser.</p>
<p>
<pre class="code">http://<hostname>:50070/dfshealth.jsp</pre>
</p>
<div id="attachment_502" class="wp-caption alignnone" style="width: 588px;">
<p><a rel="attachment wp-att-502" href="http://www.cloudera.com/?attachment_id=502"><img class="size-full wp-image-502" title="Screen shot 2010-12-28 at 3.50.23 PM" src="http://www.philwhln.com/wp-content/uploads/2010/12/Screen-shot-2010-12-28-at-3.50.23-PM.png" alt="dfshealth.jsp" width="578" height="643" /></a></p>
<p class="wp-caption-text">HDFS Health Dashboard</p>
</div>
<p>If you click on the link to &#8220;Browse the filesystem&#8221; then you will notice the hostname changes. This will jump around the data-nodes in your cluster, due to HDFS&#8217;s distributed nature. You only currently have one data-node. On Amazon EC2 this new hostname will be the internal hostname of data-node server, which is visible because you are tunnelling through the SOCKS proxy.</p>
<div id="attachment_505" class="wp-caption alignnone" style="width: 670px;">
<p><a rel="attachment wp-att-505" href="http://www.cloudera.com/?attachment_id=505"><img class="size-full wp-image-505" title="Screen shot 2010-12-28 at 3.36.08 PM" src="http://www.philwhln.com/wp-content/uploads/2010/12/Screen-shot-2010-12-28-at-3.36.08-PM.png" alt="browseDirectory.jsp" width="660" height="391" /></a></p>
<p class="wp-caption-text">HDFS File Browser</p>
</div>
<p>Ok! It looks as though our Hadoop cluster is up and running. Let&#8217;s upload our data.</p>
<p><a name="setting-up-your-local-hadoop-client"></a></p>
<h2>Setting Up Your Local Hadoop Client</h2>
<p>To run a map-reduce job on your data, your data needs to be on the Hadoop Distributed File-System. Otherwise known as HDFS. You can interact with Hadoop and HDFS with the <em>hadoop</em> command. We do not have Hadoop installed on our local machine. Therefore, we can either log into one of our Hadoop cluster machines and run the hadoop command from there, or install hadoop on our local machine. I&#8217;m going to opt for installing Hadoop on my local machine (recommended), as it will be easier to interact with the HDFS and start the Hadoop map-reduce jobs directly from my laptop.</p>
<p>Cloudera does not, unfortunately, provide a release of Hadoop for Mac OS X. Only debians and RPMs. They do provide a .tar.gz download, which we are going to use to install Hadoop locally. Hadoop is built with Java and the scripts are written in bash, so there should not be too many problems with compatibility across platforms that can run Java and bash.</p>
<p>Visit <a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','docs.cloudera.com']);" href="https://docs.cloudera.com/display/DOC/Downloading+CDH+Releases">Cloudera CDH Release</a> webpage and select <a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','archive.cloudera.com']);" href="http://archive.cloudera.com/cdh/3/">CDH3 Patched Tarball</a>.  I downloaded the same version <a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','archive.cloudera.com']);" href="http://archive.cloudera.com/cdh/3/hadoop-0.20.2+737.tar.gz ">hadoop-0.20.2+737.tar.gz</a> that Whirr installed on the cluster.</p>
<pre class="code">tar -xvzf hadoop-0.20.2+737.tar.gz
sudo mv hadoop-0.20.2+737 /usr/local/
cd /usr/local
sudo ln -s hadoop-0.20.2+737 hadoop
echo 'export HADOOP_HOME=/usr/local/hadoop' >> ~/.profile
echo 'export PATH=$PATH:$HADOOP_HOME/bin' >> ~/.profile
source ~/.profile
which hadoop # should output "/usr/local/hadoop/bin/hadoop"
hadoop version # should output "Hadoop 0.20.2+737 ..."
cp ~/.whirr/myhadoopcluster/hadoop-site.xml /usr/local/hadoop/conf/</pre>
<p>Now run your first command from your local machine to interact with HDFS. This following command is similar to &#8220;ls -l /&#8221; in bash.</p>
<p>
<pre class="code">hadoop fs -ls /</pre>
</p>
<p>You should see the following output which lists the root on the Hadoop filesystem.</p>
<pre class="code">10/12/30 18:19:59 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
Found 4 items
drwxrwxrwx   - hdfs supergroup          0 2010-12-28 10:33 /hadoop
drwxrwxrwx   - hdfs supergroup          0 2010-12-28 10:33 /mnt
drwxrwxrwx   - hdfs supergroup          0 2010-12-28 10:33 /tmp
drwxrwxrwx   - hdfs supergroup          0 2010-12-28 10:33 /user</pre>
<p>Yes, you will see a depreciation warning, since hadoop-site.xml configuration has been split into multiple files. We will not worry about this here.</p>
<p><a name="defining-the-map-reduce-task"></a></p>
<h2>Defining The Map-Reduce Task</h2>
<p>We are going write a map-reduce job that scans all the files in a given directory, takes the words found in those files and then counts the number of times words begin with any two characters.</p>
<p>For this we&#8217;re going to use a dictionary file found on my Mac OS X /usr/share/dict/words. It contains 234936 words, each on a newline. <a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','en.wikipedia.org']);" href="http://en.wikipedia.org/wiki/Words_(Unix)">Linux has a similar dictionary file</a>.</p>
<p><a name="uploading-your-data-to-hdfs"></a></p>
<h2>Uploading Your Data To HDFS (Hadoop Distributed FileSystem)</h2>
<pre class="code">hadoop fs -mkdir input
hadoop fs -put /usr/share/dict/words input/
hadoop fs -ls input</pre>
<p>You should see output similar to the following, which list the <em>words</em> file on the remote HDFS. Since my local user is &#8220;phil&#8221;, Hadoop has added the file under /user/phil on HDFS.</p>
<p><br class="spacer_" /></p>
<p>
<pre class="code">Found 1 items
-rw-r--r--   3 phil supergroup    2486813 2010-12-30 18:43 /user/phil/input/words</pre>
</p>
<p>Congratulations! You have just uploaded your first file to the Hadoop Distributed File-System on your cluster in the cloud.</p>
<p><a name="coding-your-map-and-reduce-scripts-in-ruby"></a></p>
<h2>Coding Your Map And Reduce Scripts in Ruby</h2>
<p>Map-Reduce can actually be thought of as map-group-reduce. The &#8220;map&#8221; sucks in the raw data, cuts off the fat, removes the bones and outputs the smallest possible piece of output data for each piece of input data. The &#8220;map&#8221; also outputs the key of the data. Our key will be the two-letter prefix of each word. These keys are used by Hadoop to &#8220;group&#8221; the data together. The &#8220;reduce&#8221; then takes each group of data and &#8220;reduces&#8221; it. In our case the &#8220;reduce&#8221; will be the counting occurrences of the two-letter prefixes.</p>
<p>Hadoop will do much of the work for us. It will recurse the input directory, open the files and stream the files one line at a time into our &#8220;map&#8221; script via STDIN. We will output zero, one or many output lines to STDOUT for each line of input. Since we know that our input file has exactly one word per line, we can simplify our script and always output exactly one two-letter prefix for each input line. (EDIT: words with one letter will not result in any output).</p>
<p>The output of our &#8220;map&#8221; script to STDOUT will have to be Hadoop friendly. This means we will output our &#8220;key&#8221;, then a tab character then our value and then a newline. This is what the streaming interface expects. Hadoop needs to extract the key to be able to sort and organise the data based on this key.</p>
<p>
<pre class="code"><key><tab><value><newline></pre>
</p>
<p>Our value will always be &#8220;1?, since each line has only one word with only once instance of the two-letter prefix of that word.</p>
<p>For instance, if the input was &#8220;Apple&#8221; then we would output the key &#8220;ap&#8221; and value &#8220;1?. We have seen the prefix &#8220;ap&#8221; only once in this input.</p>
<p>You should note that the value can be anything that your reduce script can interpret. For instance, the value could be a string of JSON. Here, we are keeping it very simple.</p>
<p>
<pre class="code">ap<tab>1<newline></pre>
</p>
<p>Let&#8217;s code up the mapper as <em>map.rb</em></p>
<p>
<pre class="code"># Ruby code for map.rb

ARGF.each do |line|

   # remove any newline
   line = line.chomp

   # do nothing will lines shorter than 2 characters
   next if ! line || line.length < 2

   # grab our key as the two-character prefix (lower-cased)
   key = line[0,2].downcase

   # value is a count of 1 occurence
   value = 1

   # output to STDOUT
   #
   puts key + "\t" + value.to_s

end</pre>
</p>
<p>Now we have our mapper script, let&#8217;s write the reducer.</p>
<p>Remember, the reducer is going to count up the occurences for each two-character prefix (our &#8220;key&#8221;). Hadoop will have already grouped our keys together, so even if the mapper output is in shuffled order, the reducer will now see the keys in sorted order. This means that the reducer can watch for when the key changes and know that it has seen all of the possible values for the previous key.</p>
<p>Here is an example of the STDIN and STDOUT that map.rb and reduce.rb might see. The data flow goes from left to right.</p>
<table>
<tbody>
<tr>
<th>map.rb<br />
 STDIN</th>
<th>map.rb<br />
 STDOUT</th>
<th>Hadoop<br />
 sorts<br />
 keys</th>
<th>reduce.rb<br />
 STDIN</th>
<th>reduce.rb<br />
 STDOUT</th>
</tr>
<tr>
<td>Apple<br />
 Monkey<br />
 Orange<br />
 Banana<br />
 APR<br />
 Bat<br />
 appetite</td>
<td>ap 1<br />
 mo 1<br />
 or 1<br />
 ba 1<br />
 ap 1<br />
 ba 1<br />
 ap 1</td>
<td></td>
<td>ap 1<br />
 ap 1<br />
 ap 1<br />
 ba 1<br />
 ba 1<br />
 mo 1<br />
 or 1</td>
<td>ap 3<br />
 ba 2<br />
 mo 1<br />
 or 1</td>
</tr>
</tbody>
</table>
<p>Let&#8217;s code up the reducer as <em>reduce.rb</em></p>
<p>
<pre class="code"># Ruby code for reduce.rb

prev_key = nil
key_total = 0

ARGF.each do |line|

   # remove any newline
   line = line.chomp

   # split key and value on tab character
   (key, value) = line.split(/\t/)

   # check for new key
   if prev_key &amp;&amp; key != prev_key &amp;&amp; key_total > 0

      # output total for previous key

      #
      puts prev_key + "\t" + key_total.to_s

      # reset key total for new key
      prev_key = key
      key_total = 0

   elsif ! prev_key
      prev_key = key

   end

   # add to count for this current key
   key_total += value.to_i

end</pre>
</p>
<p>You can test out your scripts on a small sample by using the &#8220;sort&#8221; command in replacement for Hadoop.</p>
<p><pre class="code">cat /usr/share/dict/words | ruby map.rb | sort | ruby reduce.rb</pre>
</p>
<p>The start of this output looks like this&#8230;</p>
<p>
<pre class="code">aa	13
ab	666
ac	1491
ad	867
ae	337
af	380</pre>
</p>
<p><a name="running-the-hadoop-job"></a></p>
<h2>Running The Hadoop Job</h2>
<p>I wrote this bash-based runner script to start the job. It uses Hadoop&#8217;s streaming service. This streaming service is what allows us to write our map-reduce scripts in Ruby. It <em>streams</em> to our script&#8217;s STDIN and reads our script&#8217;s output from our script&#8217;s STDOUT.</p>
<p>
<pre class="code">#!/bin/bash

HADOOP_HOME=/usr/local/hadoop
JAR=contrib/streaming/hadoop-streaming-0.20.2+737.jar

HSTREAMING="$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/$JAR"

$HSTREAMING \
 -mapper  'ruby map.rb' \
 -reducer 'ruby reduce.rb' \
 -file map.rb \
 -file reduce.rb \
 -input '/user/phil/input/*' \
 -output /user/phil/output</pre>
</p>
<p>We specify the command to run for the mapper and reducer and use the &#8220;-file&#8221; parameter twice to attach our two Ruby scripts. It is assumed that all other dependencies are already installed on the machine. In this case we are using no Ruby imports or requires and the Ruby interpreter is already installed on the machines in the Hadoop cluster (it came with the Cloudera Amazon EC2 image). Things become more complicated when you start to run jobs with more dependencies that are not already installed on the Hadoop cluster. This is a topic for another post.</p>
<p>&#8220;-input&#8221; and &#8220;-output&#8221; specify which files to read from for input and the directoty to send the output to. You can also specify a deeper level of recursion with more wildcards (e.g. &#8220;/user/phil/input/*/*/*&#8221;).</p>
<p>Once again, it is important that our SOCKS proxy is running, as this is the secure way that we communicate through to our Hadoop cluster.</p>
<p>
<pre class="code">sh ~/.whirr/myhadoopcluster/hadoop-proxy.sh
    <em>Running proxy to Hadoop cluster at ec2-72-44-45-199.compute-1.amazonaws.com. Use Ctrl-c to quit.</em></pre>
</p>
<p>Now we can start the Hadoop job by running our above bash script. Here is the output the script gave me at the terminal.</p>
<p>
<pre class="code">packageJobJar: [map.rb, reduce.rb, /tmp/hadoop-phil/hadoop-unjar3366245269477540365/] [] /var/folders/+Q/+QReZ-KsElyb+mXn12xTxU+++TI/-Tmp-/streamjob5253225231988397348.jar tmpDir=null
10/12/30 21:45:32 INFO mapred.FileInputFormat: Total input paths to process : 1
10/12/30 21:45:37 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-phil/mapred/local]
10/12/30 21:45:37 INFO streaming.StreamJob: Running job: job_201012281833_0001
10/12/30 21:45:37 INFO streaming.StreamJob: To kill this job, run:
10/12/30 21:45:37 INFO streaming.StreamJob: /usr/local/hadoop/bin/hadoop job  -Dmapred.job.tracker=ec2-72-44-45-199.compute-1.amazonaws.com:8021 -kill job_201012281833_0001
10/12/30 21:45:37 INFO streaming.StreamJob: Tracking URL: http://ec2-72-44-45-199.compute-1.amazonaws.com:50030/jobdetails.jsp?jobid=job_201012281833_0001
10/12/30 21:45:38 INFO streaming.StreamJob:  map 0%  reduce 0%
10/12/30 21:45:55 INFO streaming.StreamJob:  map 42%  reduce 0%
10/12/30 21:45:58 INFO streaming.StreamJob:  map 100%  reduce 0%
10/12/30 21:46:14 INFO streaming.StreamJob:  map 100%  reduce 88%
10/12/30 21:46:19 INFO streaming.StreamJob:  map 100%  reduce 100%
10/12/30 21:46:22 INFO streaming.StreamJob: Job complete: job_201012281833_0001
10/12/30 21:46:22 INFO streaming.StreamJob: Output: /user/phil/output</pre>
</p>
<p>This is reflected if you visit the job tracker console in web browser.</p>
<div id="attachment_577" class="wp-caption alignnone" style="width: 996px;">
<p><a rel="attachment wp-att-577" href="http://www.cloudera.com/?attachment_id=577"><img class="size-full wp-image-577" title="Screen shot 2010-12-30 at 10.12.46 PM" src="http://www.philwhln.com/wp-content/uploads/2010/12/Screen-shot-2010-12-30-at-10.12.46-PM.png" alt="jobTracker after successful run" width="986" height="783" /></a></p>
<p class="wp-caption-text">jobTracker after successful run</p>
</div>
<p>If you click on the job link you can see lots of information on this job. This job is completed in these images, but with a longer running job you would see the progress as the job runs. I have split the job tracker page into the following three images.</p>
<div id="attachment_578" class="wp-caption alignnone" style="width: 762px;">
<p><a rel="attachment wp-att-578" href="http://www.cloudera.com/?attachment_id=578"><img class="size-full wp-image-578" title="Screen shot 2010-12-30 at 10.15.55 PM" src="http://www.philwhln.com/wp-content/uploads/2010/12/Screen-shot-2010-12-30-at-10.15.55-PM.png" alt="Map-Reduce Job Tracker Page (part 1)" width="752" height="361" /></a></p>
<p class="wp-caption-text">Map-Reduce Job Tracker Page (part 1)</p>
</div>
<div id="attachment_579" class="wp-caption alignnone" style="width: 720px;">
<p><a rel="attachment wp-att-579" href="http://www.cloudera.com/?attachment_id=579"><img class="size-full wp-image-579" title="Screen shot 2010-12-30 at 10.16.17 PM" src="http://www.philwhln.com/wp-content/uploads/2010/12/Screen-shot-2010-12-30-at-10.16.17-PM.png" alt="Map-Reduce Job Tracker Page (part 2)" width="710" height="616" /></a></p>
<p class="wp-caption-text">Map-Reduce Job Tracker Page (part 2)</p>
</div>
<div id="attachment_580" class="wp-caption alignnone" style="width: 772px;">
<p><a rel="attachment wp-att-580" href="http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/what%e2%80%99s-new-in-hadoop-core-020/"><img class="size-full wp-image-580" title="Screen shot 2010-12-30 at 10.16.44 PM" src="http://www.philwhln.com/wp-content/uploads/2010/12/Screen-shot-2010-12-30-at-10.16.44-PM.png" alt="Map-Reduce Job Tracker Page (part 3) Graphs" width="762" height="551" /></a></p>
<p class="wp-caption-text">Map-Reduce Job Tracker Page (part 3) Graphs</p>
</div>
<p><a name="the-results"></a></p>
<h2>The Results</h2>
<p>Our map-reduce job has run successfully using Ruby. Let&#8217;s have a look at the output.</p>
<p>
<pre class="code">hadoop fs -ls output

Found 3 items
-rw-r--r--   3 phil supergroup          0 2010-12-30 21:46 /user/phil/output/_SUCCESS
drwxrwxrwx   - phil supergroup          0 2010-12-30 21:45 /user/phil/output/_logs
-rw-r--r--   3 phil supergroup       2341 2010-12-30 21:46 /user/phil/output/part-00000</pre>
</p>
<p>Hadoop output is written in chunks to sequential files part-00000, part-00001, part-00002 and so on. Our dataset is very small, so we only have one 2kb file called part-00000.</p>
<p>
<pre class="code">hadoop fs -cat output/part-00000 | head
aa	13
ab	666
ac	1491
ad	867
ae	337
af	380
ag	507
ah	46
ai	169
aj	14</pre>
</p>
<p>Our map-reduce script counted 13 words starting with &#8220;aa&#8221;, 666 words starting with &#8220;ab&#8221; and 1491 words starting with &#8220;ac&#8221;.</p>
<p><a name="conclusion"></a></p>
<h2>Conclusion</h2>
<p>Yes, it is an overkill to use Hadoop and a (very small) cluster of cloud-based machines for this example, but I think it demonstrates how you can quickly get your Hadoop cluster up and running map-reduce jobs written in Ruby. You can use the same procedure to fire-up a much larger and more powerful Hadoop cluster with a bigger dataset and more complex Ruby scripts.</p>
<p><strong>Please post any questions or suggestions you have in the comments below. They are always highly appreciated.</strong></p>
<p><a name="resources"></a></p>
<h2>Resources</h2>
<ul>
<li><a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','hadoop.apache.org']);" href="http://hadoop.apache.org/">Apache Hadoop</a></li>
<li><a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','www.cloudera.com']);" href="http://www.cloudera.com/hadoop/">Cloudera&#8217;s Distribution for Apache Hadoop (CDH)</a></li>
<li><a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','www.cloudera.com']);" href="http://www.cloudera.com/downloads/virtual-machine/">Cloudera Hadoop Training VMWare Image</a></li>
<li><a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','www.slideshare.net']);" href="http://www.slideshare.net/philwhln/map-reduce-using-perl">Map-Reduce Using Perl</a></li>
<li><a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','code.google.com']);" href="http://code.google.com/p/jclouds/">jclouds </a></li>
<li><a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','en.wikipedia.org']);" href="http://en.wikipedia.org/wiki/Words_(Unix)">Words file on Unix-like operating systems</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>A profile of Apache Hadoop MapReduce computing efficiency (continued)</title>
		<link>http://www.cloudera.com/blog/2010/12/a-profile-of-hadoop-mapreduce-computing-efficiency-continued/</link>
		<comments>http://www.cloudera.com/blog/2010/12/a-profile-of-hadoop-mapreduce-computing-efficiency-continued/#comments</comments>
		<pubDate>Wed, 15 Dec 2010 14:00:39 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[#cdh3]]></category>
		<category><![CDATA[apache hadoop]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[cloudera's distribution for hadoop]]></category>
		<category><![CDATA[computing efficency]]></category>
		<category><![CDATA[hadoop efficiency]]></category>
		<category><![CDATA[sra]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=5650</guid>
		<description><![CDATA[Guest post from Paul Burkhardt, a Research Developer at SRA International, Inc. where he develops large-scale, distributed computing solutions. Part II Previously we proposed how we measure the performance in Hadoop MapReduce applications in an effort to better understand the computing efficiency. In this part, we&#8217;ll describe some results and illuminate both good and bad [...]]]></description>
			<content:encoded><![CDATA[<p><em><strong>Guest post from Paul Burkhardt, a Research Developer at SRA International, Inc. where he develops large-scale, distributed computing solutions.</strong></em></p>
<h2>Part II</h2>
<p>Previously we proposed how we measure the performance in Hadoop MapReduce applications in an effort to better understand the computing efficiency. In this part, we&#8217;ll describe some results and illuminate both good and bad characteristics.</p>
<p>We selected our SIFT-M MapReduce application, described in our presentation at Hadoop World 2010 <a NAME=3><sup>[3]</sup></a>, as the candidate algorithm for Node Scalability since it is embarrassingly parallel and is representative of compute-intensive applications where the bulk of work is computation and not data movement. The Terasort MapReduce benchmark is used for the data scalability tests since it has a greater dependence on the distribution of data than the SIFT algorithm. The Terasort MapReduce benchmark is distributed with the Hadoop codebase. The Yahoo implementation gained notoriety for breaking the terabyte sorting benchmark in 2009 for sorting 100 TB in 173 minutes<a NAME=4><sup>[4]</sup></a>.</p>
<p>Examples indicating saturation and steady-state condition are given in the following plots.</p>
<p><strong><a href="https://www.cloudera.com/wp-content/uploads/2010/12/SIFTM-on-Yale-Faces.png"><img class="alignleft size-full wp-image-5651" title="SIFTM on Yale Faces" src="https://www.cloudera.com/wp-content/uploads/2010/12/SIFTM-on-Yale-Faces.png" alt="Hadoop MapReduce, SIFTM on Yale Face - Cloud1" width="650" height="443" /></a></strong></p>
<p><strong>Figure 1 SIFT-M phase plot.</strong></p>
<p><strong><a href="https://www.cloudera.com/wp-content/uploads/2010/12/SIFTM-on-Yale-Faces-cloud1.png"><img class="alignleft size-full wp-image-5652" title="SIFTM on Yale Faces-cloud1" src="https://www.cloudera.com/wp-content/uploads/2010/12/SIFTM-on-Yale-Faces-cloud1.png" alt="Hadoop MapReduce SIFTM on Yale Faces" width="641" height="442" /></a></strong></p>
<p><strong>Figure 2 SIFT-M task rate plot.</strong></p>
<p><strong><a href="https://www.cloudera.com/wp-content/uploads/2010/12/CPU-Usage.png"><img class="alignleft size-full wp-image-5653" title="CPU Usage" src="https://www.cloudera.com/wp-content/uploads/2010/12/CPU-Usage.png" alt="Apache MapReduce SIFTM SRA" width="817" height="631" /></a></strong></p>
<p><strong>Figure 3 SIFT-M Ganglia CPU plot.</strong></p>
<p><strong>Figure 1</strong>,<strong> Figure 2</strong>, and <strong>Figure 3</strong> depict our SIFT-M job on a cluster with a total of 56&#215;2=112 PEs, the maximum number of task slots. We see from <strong>Figure 1</strong> the task count is maximized for the duration of the benchmark indicating the system is saturated and has reached steady-state. The task rate plotted in <strong>Figure 2</strong> displays a constant arrival rate, a linear curve, expected for steady-state and is approximately 3 tasks per second. Further evidence of the CPU saturation is provided by the Ganglia hardware metrics in <strong>Figure 3</strong>. The next plots in <strong>Figure 4, Figure 5,</strong> and <strong>Figure 6</strong> demonstrate an under-utilized cluster which supports 256 PEs. We can see the concurrency is not maintained in the map phase, the map task assignment never reaches a constant rate and declines rapidly. The reduce phase reaches steady-state but is far from saturating the available execution slots.</p>
<p><strong><a href="https://www.cloudera.com/wp-content/uploads/2010/12/DataScalability-TB-terasort.png"><img class="alignleft size-full wp-image-5654" title="DataScalability TB terasort" src="https://www.cloudera.com/wp-content/uploads/2010/12/DataScalability-TB-terasort.png" alt="SRA and Cloudera Hadoop MapReduce guest blog" width="647" height="442" /></a></strong></p>
<p><strong>Figure 4 Under-utilized Terasort phase plot.</strong></p>
<p><strong><a href="https://www.cloudera.com/wp-content/uploads/2010/12/DataScalability-TB-terasort-cloud3.png"><img class="alignleft size-full wp-image-5655" title="DataScalability TB terasort-cloud3" src="https://www.cloudera.com/wp-content/uploads/2010/12/DataScalability-TB-terasort-cloud3.png" alt="SRA Cloudera Hadoop MapReduce guest blog" width="631" height="448" /></a></strong></p>
<p><strong>Figure 5 Under-utilized Terasort task rate plot.</strong></p>
<p><strong><a href="https://www.cloudera.com/wp-content/uploads/2010/12/CPU-Usage-2.png"><img class="alignleft size-full wp-image-5656" title="CPU Usage 2" src="https://www.cloudera.com/wp-content/uploads/2010/12/CPU-Usage-2.png" alt="SRA and Cloudera Hadoop MapReduce guest blog" width="813" height="654" /></a></strong></p>
<p><strong>Figure 6 Under-utilized Terasort Ganglia CPU plot.</strong></p>
<p>An example of a Terasort job with better system utilization is depicted in Figure 7, Figure 8, and Figure 9. Note there are multiple shuffle, sort, and reduce phases. This can arise when there are more reduce tasks allocated than available slots so the reduce phases complete in &#8220;waves&#8221;. It is important to balance the concurrency between the map and reduce phases to achieve the best performance.</p>
<p><strong><a href="https://www.cloudera.com/wp-content/uploads/2010/12/terasort-cloud1.png"><img class="alignleft size-full wp-image-5657" title="terasort-cloud1" src="https://www.cloudera.com/wp-content/uploads/2010/12/terasort-cloud1.png" alt="SRA and Cloudera Hadoop MapReduce guest blog" width="620" height="435" /></a></strong></p>
<p><strong>Figure 7 Terasort phase plot.</strong></p>
<p><strong><a href="https://www.cloudera.com/wp-content/uploads/2010/12/Terasort-cloud1-line.png"><img class="alignleft size-full wp-image-5658" title="Terasort-cloud1-line" src="https://www.cloudera.com/wp-content/uploads/2010/12/Terasort-cloud1-line.png" alt="SRA and Cloudera Hadoop MapReduce guest blog" width="625" height="435" /></a></strong></p>
<p><strong>Figure 8 Terasort task rate plot.</strong></p>
<p><strong><a href="https://www.cloudera.com/wp-content/uploads/2010/12/CPU-Usage-3.png"><img class="alignleft size-full wp-image-5659" title="CPU Usage 3" src="https://www.cloudera.com/wp-content/uploads/2010/12/CPU-Usage-3.png" alt="SRA and Cloudera Hadoop MapReduce guest post" width="789" height="609" /></a></strong></p>
<p><strong>Figure 9 Terasort Ganglia CPU plot.</strong></p>
<p>The histogram plots for the SIFT-M job identify the variance in workload and performance per host, evident in <strong>Figure 10, Figure 11,</strong> and <strong>Figure 12.</strong> Clearly the &#8220;red&#8221; host is struggling with the tasks.</p>
<p><strong><a href="https://www.cloudera.com/wp-content/uploads/2010/12/SIFTM-on-Yale-Faces-cloud-tasks.png"><img class="alignleft size-full wp-image-5660" title="SIFTM on Yale Faces-cloud-tasks" src="https://www.cloudera.com/wp-content/uploads/2010/12/SIFTM-on-Yale-Faces-cloud-tasks.png" alt="SRA and Cloudera Hadoop MapReduce guest blog" width="615" height="443" /></a></strong></p>
<p><strong>Figure 10 SIFT-M histogram of task count.</strong></p>
<p><strong><a href="https://www.cloudera.com/wp-content/uploads/2010/12/SIFTM-on-Yale-Faces-cloud-duration.png"><img class="alignleft size-full wp-image-5661" title="SIFTM on Yale Faces-cloud-duration" src="https://www.cloudera.com/wp-content/uploads/2010/12/SIFTM-on-Yale-Faces-cloud-duration.png" alt="SRA and Cloudera Hadoop MapReduce guest blog" width="617" height="444" /></a></strong></p>
<p><strong>Figure 11 SIFT-M histogram of task duration.</strong></p>
<p><strong><a href="https://www.cloudera.com/wp-content/uploads/2010/12/SIFTM-on-Yale-Faces-cloud-tasks1.png"><img class="alignleft size-full wp-image-5662" title="SIFTM on Yale Faces-cloud-tasks" src="https://www.cloudera.com/wp-content/uploads/2010/12/SIFTM-on-Yale-Faces-cloud-tasks1.png" alt="SRA and Cloudera Hadoop MapReduce guest blog" width="615" height="443" /></a></strong></p>
<p><strong>Figure 12 SIFT-M histogram of I/O in bytes.</strong></p>
<p>The final plots depict the scalability performance. The plot in <strong>Figure 13</strong> compares the node scalability between all the clusters. All clusters scale linearly for the fixed-sized study as expected. <strong>Figure 14</strong> displays the data scalability for the scaled-sized problem. We find that each cluster exhibits a decrease in throughput as the input scales correlating with a drop in CPU utilization which we attribute to greater latency in moving the data to the CPU. The dramatic drop in the early data points is likely due to the input fitting entirely in memory.</p>
<p><strong><a href="https://www.cloudera.com/wp-content/uploads/2010/12/Node-Scalability.png"><img class="alignleft size-full wp-image-5663" title="Node Scalability" src="https://www.cloudera.com/wp-content/uploads/2010/12/Node-Scalability.png" alt="SRA and Cloudera Hadoop MapReduce guest blog" width="784" height="475" /></a></strong></p>
<p><strong>Figure 13 Node scalability plot.</strong></p>
<p><strong><a href="https://www.cloudera.com/wp-content/uploads/2010/12/PreliminaryDataScalability.png"><img class="alignleft size-full wp-image-5664" title="PreliminaryDataScalability" src="https://www.cloudera.com/wp-content/uploads/2010/12/PreliminaryDataScalability.png" alt="SRA and Cloudera Hadoop MapReduce guest blog" width="790" height="469" /></a></strong></p>
<p><strong>Figure 14 Data scalability plot.</strong></p>
<p>We expect embarrassingly parallel Hadoop MapReduce applications will scale linearly and our Node Scalability results align very well with this expectation. The preliminary results of our Data Scalability study indicate performance degrades with increasing input as a result of less data locality and higher demand on disk and network resources. Since input is necessarily shared by all compute hosts contention on the same disks increases with input size even when different file blocks are requested. The imposed data dependency impedes an important advantage of MapReduce, the overlap of communication and computation. The MapReduce paradigm masks the latency from information requests by transferring map output to the reduce hosts during the map phase, known as shuffling, rather than waiting for all map tasks to complete. But the map tasks are also requesting data between hosts because of the cluster-wide data dependency, and so the shuffling phase contends for the very same network and storage resources.</p>
<p>Although we used the default Hadoop FIFO scheduler, we suggest that selecting different job schedulers and increasing the replication factor, HDFS block size, and virtual memory page size in isolation or combination can improve performance. Developers may also need to create specialized Partition and Combiner classes to address data skew. A judicious choice in the key-space partitioning and the number of reduce tasks will have significant impact on the performance. Since both map and reduce phases overlap, care is needed not to over-subscribe the system during the map phase but avoid under-utilizing the resources in the reduce phase. A system with many smaller disks over few large disks is recommended in conjunction with high compute density.</p>
<h2><strong>Acknowledgement</strong></h2>
<p>I would like to thank Jonathan Jarrett for developing the scripts for the Gnuplot charts and automated benchmarks, and thank David Ritch and Adam Watts for administering our clusters. I also thank the rest of the SRA team for their support.</p>
<hr size="1" />
<a href="#3">[3]</a>  http://www.cloudera.com/resource/hw10_sifting_clouds<br />
<a href="#4">[4]</a>  http://sortbenchmark.org</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/12/a-profile-of-hadoop-mapreduce-computing-efficiency-continued/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Do the Schimmy: Efficient Large-Scale Graph Analysis with Hadoop, Part 2</title>
		<link>http://www.cloudera.com/blog/2010/11/do-the-schimmy-efficient-large-scale-graph-analysis-with-hadoop-part-2/</link>
		<comments>http://www.cloudera.com/blog/2010/11/do-the-schimmy-efficient-large-scale-graph-analysis-with-hadoop-part-2/#comments</comments>
		<pubDate>Thu, 18 Nov 2010 14:00:50 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[apache hadoop]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[pagerank]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=5437</guid>
		<description><![CDATA[Continued Guest Post from Michael Schatz and Jimmy Lin Part 2: Efficient Graph Analysis in Hadoop with Schimmy In part 1, we looked at how extremely large graphs can be represented and analyzed in Hadoop/MapReduce. Here in part 2 we will examine this design in more depth to identify inefficiencies, and present some simple solutions [...]]]></description>
			<content:encoded><![CDATA[<p><em>Continued Guest Post from</em> <a href="http://schatzlab.cshl.edu">Michael Schatz</a> <em>and</em> <a href="http://www.umiacs.umd.edu/~jimmylin/">Jimmy Lin</a></p>
<p><strong>Part 2: Efficient Graph Analysis in Hadoop with Schimmy</strong></p>
<p>In part 1, we looked at how extremely large graphs can be represented and analyzed in Hadoop/MapReduce. Here in part 2 we will examine this design in more depth to identify inefficiencies, and present some simple solutions that can be applied to many Hadoop/MapReduce graph algorithms. The speedup using these techniques is substantial: as a prototypical example, we were able to reduce the running time of PageRank on a webgraph with 50.2 million vertices and 1.4 billion edges by as much as 69% on a small 20-core Hadoop cluster at the University of Maryland (full details available <a href="http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Schatz_MLG2010.pdf">here</a>). We expect that similar levels of improvement will carry over to many of the other problems we discussed before (the Kevin Bacon game, and DNA sequence assembly in particular).</p>
<p>As explained in part 1, when processing graphs in MapReduce, the mapper emits &#8220;messages&#8221; from vertices to their neighbors, and the shuffle phase effectively routes each message to the proper destination vertex. For computationally trivial problems such as PageRank, the running time is dominated by shuffling data across the network, so anything we can do to reduce that data will also decrease the running time. Fortunately, in many cases we can reduce the data volume by combining multiple messages destined for the same vertex into a single message using a combiner &#8211; remember, a combiner is a &#8220;mini-reducer&#8221; that executes a function on a subset of the values with the same key. Because the subset may be unstable from run to run, combiners can only be safely used if the function is associative and commutative &#8211; that is, when the order that values are processed doesn&#8217;t change the result.</p>
<p>For computations like PageRank, where the reducer merely sums the values received from neighboring vertices, we can safely sum values in any order, including summing values in a combiner before the messages are even sent across the network. If there are many messages destined for the same vertex emitted from mappers on a single machine, the combiner can replace all those messages with just one, and save substantial network communication costs (without changing the final result). Using combiners is a standard best practice for algorithms in MapReduce (graphs and otherwise), and in our experiments it improved the running time of PageRank by 18%.</p>
<p>As a novel twist, we also evaluated a technique called &#8220;in-mapper combiners&#8221; instead of generic combiners. The idea is instead of having the mapper immediately emit messages, it first buffers messages inside a mapper-local hash table stored in memory. If we get lucky, the mapper will create many messages destined to the same vertices right after each other, and we can immediately update the combined values in memory, without having to pay the costs to serialize, write, sort, read, and deserialize the individual messages from disk. Although, we have to periodically flush the in-mapper table so we don&#8217;t overflow the memory available on the machine. You can see where the name of this technique comes from: in essence, we are doing the combining right inside the mapper! In our experiments, using in-mapper combiners further improved performance by another 16% beyond the generic combiner.</p>
<p>If you have a commutative and associative computation, combiners will almost always help the performance of your algorithm. The magnitude of the improvement (18% vs. 5% vs. 34%), though, will very much depend on your cluster environment and the distribution of the graph in that environment. In the worst case, none of the vertices stored on a single machine share a common neighbor, so the combiner (or in-mapper combiner) won&#8217;t help at all. What we would like is all neighboring vertices (or all tightly connected vertices) are stored on the same machine so that all their messages can be efficiently combined. Unfortunately, optimally distributing the graph like this is a hard clustering problem in itself, but in some cases we can use simple heuristics that are very effective.</p>
<p>The default method for assigning vertices to machines (both when the graph is stored in HDFS and during a MapReduce job) is a more-or-less random function called the HashPartitioner, which computes the hash value of the vertex id (key) to partition the graph. &#160;As a random function, this approach works well to balance the number of vertices processed and stored on each machine. Furthermore, if the graph is roughly uniformly connected, then there is a good chance that some neighboring vertices will be processed together, which will then benefit from the combiner. However, many real world networks have higher-order structure that this approach won&#8217;t capture. For example, the Web is organized into domains, and webpages in the same domain generally have many more links to each other than to other remote corners of the Web. The default HashPartitioner is totally blind to this structure, which means the combiner will lose opportunities to aggregate partial results.</p>
<p>We could exploit some of the network structure if we could only assign graph vertices to machines in blocks by domain instead of randomly. Surprisingly, this level of control is relatively easy to implement: instead of referencing vertices by some arbitrary vertex id, we instead use consecutive numbers derived from some attribute of the vertices. For example, we could preprocess the webgraph and alphabetically sort all the URLs. We then renumber vertices based on the sort order (i.e. http://aaa.com/index.html is id 1, http://aaa.com/welcome.html is id 2, http://abc.com/index.com id 3, etc). By virtue of the sorting, pages from the same domain will form blocks of consecutive ids.</p>
<p>The only remaining challenge is to partition the vertices using ranges instead of hash values. This can be achieved using a variant of the RangePartitioner used for the Terasort challenge, where Hadoop/MapReduce is used to globally sort a list of values by partitioning the space of keys. Conceptually, webpages 1-100,000 are processed together, webpages 100,001 &#8211; 200,000 are processed together, and so forth (the actual size of the blocks will vary). This way, each web domain will be assigned to a single or perhaps a few machines and thus the combiner will be much more effective. If the range boundaries don&#8217;t exactly coincide with the domain boundaries, it causes a small bit of missed locality, but overall using a RangePartitioner leads to a huge improvement in performance over the HashPartitioner: in our experiments, a 40% improvement in running time of PageRank on our webgraph just by renumbering and repartitioning the graph!</p>
<p>The final inefficiency we explored was if there was a way to more effectively distribute the graph structure within an iterative MapReduce graph algorithm. Remember in the standard design, the mapper emits messages for neighboring vertices, and also reemits the graph vertices themselves so that they will be shuffled together to the same reducer. This ensures the graph structure is available in the right reducer, but is not efficient. For example, if we store many attributes in the graph vertices (links to neighbors, text of the webpage, embedded images, date of collection, etc), and only distribute messages to a small number of neighbors (perhaps just from vertices with a certain keyword), the vast majority of data emitted and shuffled will be the graph structure with hardly any computationally meaningful data exchanged.</p>
<p>This spring, we asked ourselves if it was truly necessary to do so, especially when the graph structure does not change at all between iterations in PageRank. As a result of this discussion, we came up with a new technique called Schimmy (hint: this is how Schatz + Jimmy think about graphs) that separates the messages (mutable-data flow) from the graph structure (immutable-data flow). The two key observations that make it work are 1) the partitioning of vertices (assignment of vertices to machines) is stable across MapReduce iterations so that conceptually once a vertex is assigned to machine <em>X</em>, it is always processed on machine <em>X</em>, and 2)&#160; it is possible to merge sort intermediate key-value pairs with &#8220;side data&#8221; in the reducer. By default, MapReduce only sorts key-value pairs emitted by the mappers, but if we are willing to go a little outside the standard APIs, we can merge together messages and vertices from separate sorted data streams.</p>
<p>In Schimmy, the mapper iterates over the input vertices more-or-less as before but only emits messages and not the vertex tuples (i.e., graph structure). The messages are then combined and shuffled as before to route them to the proper destination machines. Then, at the last possible moment, the reducer code merges together the shuffled messages with the vertices (the same files that the mappers processed). The reducer then computes the new values for each vertex, and stores the final updated graph as before. If you&#8217;re a database guru, you might describe this as a reduce-side parallel merge join between vertices and messages.</p>
<p>In essence, Schimmy short-circuits reshuffling the vertices because the vertices get shuffled the same way in every iteration. This slight change saves substantial time, between 10% and 20% of each iteration in our prototypical analysis of PageRank, leading to an overall saving of 69% when used in conjunction with RangePartitioning and in-mapper combining over the previous best practice of using a combiner. In some ways, though, PageRank is the worst case for Schimmy in that PageRank sends messages from every vertex to every neighbor in every iteration. In less extreme algorithms (such as the first round of the Kevin Bacon game), where only a small fraction of vertices send messages to their neighbors, Schimmy should be even more effective.</p>
<p>In conclusion, Schimmy and in-mapper combiners make graph algorithms faster because they reduce the total amount of computation and the total volume of data to exchange. In-mapper combining is also effective because RAM is orders of magnitude faster than disk and because it can then completely skip the work of serializing &amp; deserializing intermediate key-value pairs. RangePartitioning is effective because it can drastically improve locality, leading to more opportunities for local aggregation (and hence less data to shuffle across the network). These ideas are widely applicable (and perhaps obvious in retrospect), but are often overlooked in distributed environments where the initial temptation may be to simply throw more machines at the problem. While a brute force approach makes sense for exploratory data analysis, once we know what we need, it pays to refine the algorithm by exploiting locality and reducing data volumes, especially at the slowest levels of the storage hierarchy. This is simply good computer science!</p>
<p style="text-align: center;"><a href="https://www.cloudera.com/wp-content/uploads/2010/11/webgraph_summary1.png"><img class="size-full wp-image-5449  aligncenter" title="webgraph_summary" src="https://www.cloudera.com/wp-content/uploads/2010/11/webgraph_summary1.png" alt="" width="1050" height="750" /></a></p>
<p style="text-align: left;"><a href="https://www.cloudera.com/wp-content/uploads/2010/11/webgraph_summary1.png"></a>We hope you have found these articles interesting! For a more in depth discussion of these techniques, please see the Schimmy paper referenced above and the <a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/pagerank.html">Schimmy reference implementation in Cloud9</a>, which is a library developed at the University of Maryland designed to serve as both a teaching tool and to support research in data-intensive text processing.</p>
<p>Note, for clarity we have glossed over the details of how tasks and file splits are scheduled on worker machines. In particular, file splits are arbitrarily scheduled on different machines in different iterations (the assignment of tasks is unstable), but reduce tasks can peek inside the splits to simulate a stable scheduling. To all the Hadoop hackers out there &#8211; we would love to work with you to develop a stable task scheduler that could be used to further cut unnecessary network traffic and improve performance. There are some significant technical challenges to make this approach work, especially to do so while ensuring reliability, but such an addition would enable some advanced techniques that are currently not possible. Ideally, this would lead to the development of a premier open-source large-scale graph analysis system, built directly on top of Hadoop MapReduce. Perhaps it will eventually prove superior to Google&#8217;s latest closed-source graph processing system Pregel.</p>
<p style="text-align: center;">
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/11/do-the-schimmy-efficient-large-scale-graph-analysis-with-hadoop-part-2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Do the Schimmy: Efficient Large-Scale Graph Analysis with Hadoop</title>
		<link>http://www.cloudera.com/blog/2010/11/do-the-schimmy-efficient-large-scale-graph-analysis-with-hadoop/</link>
		<comments>http://www.cloudera.com/blog/2010/11/do-the-schimmy-efficient-large-scale-graph-analysis-with-hadoop/#comments</comments>
		<pubDate>Mon, 15 Nov 2010 14:00:50 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA["dna sequencing"]]></category>
		<category><![CDATA["human genomes"]]></category>
		<category><![CDATA[dna]]></category>
		<category><![CDATA[genomes]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=5366</guid>
		<description><![CDATA[Guest Post by Michael Schatz and Jimmy Lin Michael Schatz is an assistant professor in the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory. His research interests are in developing large-scale DNA sequence analysis methods to search for DNA sequence variations related to autism, cancer, and other human diseases, and also to assemble [...]]]></description>
			<content:encoded><![CDATA[<h3>Guest Post by <a href="http://schatzlab.cshl.edu">Michael Schatz</a><span style="color: #0689f8;"> </span><span style="color: #048fe5;"><span style="color: #000000;">and <a href="http://www.umiacs.umd.edu/~jimmylin/">Jimmy Lin</a></span></span></h3>
<p><em>Michael Schatz is an assistant professor in the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory. His research interests are in developing large-scale DNA sequence analysis methods to search for DNA sequence variations related to autism, cancer, and other human diseases, and also to assemble the genomes of new organisms. Given the recent tremendous advances of DNA sequencing technologies, Michael has pioneered the use of Hadoop and cloud computing for accelerating genomics, as described in a guest <a href="http://www.cloudera.com/blog/2009/10/analyzing-human-genomes-with-hadoop/">blog post last fall</a>.</em></p>
<p><em>Jimmy Lin is an associate professor in the College of Information Studies at the University of Maryland. His research lies at the intersection of information retrieval and natural language processing, with an emphasis on large-scale distributed algorithms. Currently, Jimmy is spending his sabbatical at Twitter.</em></p>
<p><strong> </strong></p>
<h2><strong>Part 1: Graphs and Hadoop</strong></h2>
<p>Question: What do PageRank, the Kevin Bacon game, and DNA sequencing all have in common?</p>
<p>As you might know, PageRank is one of the many features Google uses for computing the importance of a webpage based on the other pages that link to it. The intuition is that pages linked from many important pages are themselves important. In the Kevin Bacon game, we try to find the shortest path from Kevin Bacon to your favorite movie star based on who they were costars with. For example, there is a 2 hop path from Kevin Bacon to Jason Lee: Kevin Bacon starred in A Few Good Men with Tom Cruise, whom also starred in Vanilla Star with Jason Lee. In the case of DNA sequencing, we compute the full genome sequence of a person (~3 billion nucleotides) from many short DNA fragments (~100 nucleotides) by constructing and searching the genome assembly graph. The assembly graph connects fragments with the same or similar sequences, and thus long paths of a particular form can spell out entire genomes.</p>
<p>The common aspect for these and countless other important problems, including those in defense &amp; intelligence, recommendation systems &amp; machine learning, social networking analysis, and business intelligence, is the need to analyze enormous graphs: the Web consists of trillions of interconnected pages, IMDB has millions of movies and movie stars, and sequencing a single human genome requires searching for paths between billions of short DNA fragments. At this scale, searching or analyzing a graph on a single machine would be time-consuming at best and totally impossible at worst, especially when the graph cannot possibly be stored in memory on a single computer.</p>
<p>Fortunately, Hadoop and MapReduce can enable us to tackle the largest graphs around by scaling up many graph algorithms to run on entire clusters of commodity machines. The idea of using MapReduce for large-scale graph analysis is as old as MapReduce itself &#8211; PageRank was one of the original applications for which Google developed MapReduce.</p>
<p>Formally, graphs are comprised of vertices (also called nodes) and edges (also called links). Edges may be &#8220;directed&#8221; (e.g., hyperlinks on Web) or &#8220;undirected&#8221; (e.g., costars in movies). For convenient processing in MapReduce, graphs are stored as key-value pairs, in which the key is the vertex id (URL, movie name, etc), and the value is a complex record called a &#8220;tuple&#8221; that contains the list of neighboring vertices and any other attributes of the graph vertices (text of the webpage, date of the movie, etc). The key point is that the graph will be distributed across the cluster so different portions of the graph, including direct neighbors, may be stored on physically different machines. Nevertheless, we can process the graph in parallel using Hadoop/MapReduce, to compute PageRank or solve the Kevin Bacon game without ever loading the entire graph on one machine.</p>
<p>Graph algorithms in Hadoop/MapReduce generally follow the same pattern of execution: (1) in the map phase, some computation is independently executed on all the vertices in parallel, (2) in the shuffle phase, the partial results of the map phase are passed along the edges to neighboring vertices, including when those vertices are located on physically different machines, and (3) in the reduce phase, the vertices compute a new value based on all the incoming values (once again in parallel). Generically, we can speak of vertices passing &#8220;messages&#8221; to their neighbors. For example, in PageRank the current PageRank value of each vertex is divided up and distributed to their neighbors in the map and shuffle phases, and in the reduce phase the destination vertices compute their updated PageRank value as the sum of the incoming values. If necessary, the algorithm can iterate and rerun the MapReduce code multiple times, each time updating a vertex&#8217;s value based on the new values passed from its neighbors.</p>
<p>This algorithm design pattern fits the large class of graph algorithms that need to distribute &#8220;messages&#8221; between neighboring vertices. For search problems like the Kevin Bacon game, we can use this pattern to execute a &#8220;frontier search&#8221; that initially distributes the fact that there is a 1-hop path from Kevin Bacon to all of his immediate costars in the first MapReduce iteration. In the second MapReduce iteration the code extends these partial 1-hop paths to all of his 2-hop neighbors, and so forth until we find the shortest path to our favorite movie star. Be warned, though, that frontier search algorithms generally require space that is exponential in the search depth &#8211; therefore a na&#239;ve frontier search in MapReduce is not appropriate for searching for very deep connections: you may exhaust the disk storage of your cluster or wait a long time waiting for the network to shuffle terabytes upon terabytes of intermediate data. In contrast, PageRank is computed using values just from immediate neighbors, and is therefore more suitable for parallelization with Hadoop/MapReduce.</p>
<p>The other main technical challenge of MapReduce graph algorithms is that the graph structure must be available at each iteration, but in the design above we only distribute the messages (partial PageRank values, partial search paths, etc). This challenge is normally resolved by &#8220;passing along&#8221; the graph structure from the mappers to the reducers. In more detail: the mapper reads in a vertex as input, emits messages for neighboring vertices using the neighboring vertex ids as the keys, and <em>also</em> reemits the vertex tuple with the current vertex id as the key. Then, as usual, the shuffle phase collects key-value pairs with the same key, which effectively collects together a vertex tuple with all the messages destined for that vertex (remember, this happens in parallel on multiple reducers). The reducer then processes each vertex tuple with associated messages, computes an updated value, and saves away the updated vertex with the complete graph structure for the next iteration. But wait, you might ask: doesn&#8217;t this entail the mappers emitting two different types of values (messages destined for neighboring vertices and the graph structure)? Yes, this is handled by &#8220;tagging&#8221; each value to indicate which type it is, so that the reducer can process appropriately. For more details about such graph algorithms, you be interested in Jimmy Lin and Chris Dyer&#8217;s<a href="http://mapreduce.me/"> recent book on MapReduce algorithm design</a>.</p>
<p>This basic design works and with it we can compute PageRank, solve the Kevin Bacon game, assemble together genomes, and attack many other large-scale graph problems. However, it has several inefficiencies that needlessly slow it down, such as the poor use of locality and substantial unnecessary computation. In part two we will explore the causes of those inefficiencies, and present a set of simple techniques called Schimmy that we developed that can dramatically improve the runtime of virtually all Hadoop/MapReduce graph algorithms without requiring any changes to the underlying Hadoop implementation.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/11/do-the-schimmy-efficient-large-scale-graph-analysis-with-hadoop/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Tackling Large Scale Data in Government</title>
		<link>http://www.cloudera.com/blog/2010/11/tackling-large-scale-data-in-government/</link>
		<comments>http://www.cloudera.com/blog/2010/11/tackling-large-scale-data-in-government/#comments</comments>
		<pubDate>Tue, 02 Nov 2010 14:00:56 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[boozallenhamilton]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[government]]></category>
		<category><![CDATA[large-scale data]]></category>
		<category><![CDATA[mahout]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=5223</guid>
		<description><![CDATA[This is a guest post provided by Booz Allen Hamilton data analysis consultant, Aaron Cordova. &#160;Aaron specializes in large-scale distributed data processing systems. Working within the U.S. federal government arena provides plenty of opportunities to encounter large-scale data analysis. Projects ranging from massive health studies to high-velocity network security events to new sources of imagery [...]]]></description>
			<content:encoded><![CDATA[<p><em><strong>This is a guest post provided by Booz Allen Hamilton data analysis consultant, Aaron Cordova. &#160;Aaron specializes in large-scale distributed data processing systems.</strong></em></p>
<p>Working within the U.S. federal government arena provides plenty of opportunities to encounter large-scale data analysis. Projects ranging from massive health studies to high-velocity network security events to new sources of imagery and video represent a huge increase in the amount of data that must be not only stored but processed quickly and efficiently. These challenges are at once a daunting and exciting chance to turn data into a positive impact for the country. The large-scale data processing technologies created and supported by Cloudera play a big part in answering that challenge.</p>
<p>Often our clients have an immediate need to analyze the data at hand, to discover patterns, reveal threats, monitor critical systems, and make decisions about the direction the organization should take. Several constraints are always present: the need to implement new analytics quickly enough to capitalize on new data sources, limits on the scope of development efforts, and the pressure to expand mission capability without an increase in budgets. For many of these organizations, the large data processing stack (which includes the simplified programming model MapReduce, distributed file systems, semi-structured stores, and integration components, all running on commodity class hardware) has opened up a new avenue for scaling out efforts and enabling analytics that were impossible in previous architectures.</p>
<p>We have found this new ecosystem to be remarkably versatile at handling various types of data and classes of analytics. When working to help solve clients&#8217; large-scale data analysis problems we first take a comprehensive look at their existing environment, resources available, the nature of data sources, and immediate questions that must be answered of the data. Usually a large data processing solution will be composed of several pieces that are composed into a system that provide the desired capability. This can range from real-time tipping to vast index and query capabilities to periodic and deep analysis of all the data available. Constructing a solution almost always requires one or more new and highly scalable components from the large-scale data analysis software stack, and integration with conventional data storage and processing software. Having the ability to pick and choose which elements of the stack to include and having well-defined interfaces and in some cases interoperability standards is essential to making the system work. This is a major reason that we value the open source community and concept of the large-scale data analysis ecosystem.</p>
<p>Perhaps the most exciting benefit, however, from moving to these highly scalable architectures is that after we&#8217;ve solved the immediate issues, often with a system that can handle today&#8217;s requirements and scale up to 10x or more, is that new analytics and capabilities are now incredibly easy to develop, evaluate, and integrate thanks to the speed and ease of MapReduce, Pig, Hive, and other technologies. More than ever the large-scale data analysis software stack is proving to be a platform for innovation.</p>
<p>The response to the challenge of large-scale data analysis continues to emerge and there is room for ongoing innovation. One example of this is evident as large-scale data systems or clouds become more numerous; the task of integrating analysis across those clouds remains an area of open research. Even integrating data sources, existing systems, and delivery mechanisms within departments of the same enterprise can be a challenge and may require new solutions.</p>
<p>Recently when researching the problem of large-scale Biometric search, we at Booz Allen realized the need for a highly scalable and low-latency method of performing fuzzy matching, i.e. returning the most similar items when there is no exact match, in response to growing databases of fingerprints and requirements to identify individuals quickly. It was clear that Hadoop would provide a good platform on which to build a new distributed fuzzy matching capability for several reasons. The ability to run MapReduce over all stored biometrics would help in clustering the data to reduce the search space. The data distribution and replication features of HDFS would provide a reliable storage system with enough parallelism to support fast queries running on multiple machines. The result of our research is a system we developed called FuzzyTable.</p>
<p>FuzzyTable is a large-scale, low-latency, parallel fuzzy-matching database built over Hadoop. It can use any matching algorithm that can compare two often high-dimensional items and return a similarity score. This makes it suitable not only for comparing fingerprints but other biometric modalities, images, audio, and anything that can be represented as a vector of features.</p>
<p>Fuzzy matching involves extracting features from the data of interest, and running a given fuzzy matching,&#160; distance, or similarity algorithm over the resulting feature vectors to produce the numerical score.</p>
<p style="text-align: center;"><a href="http://www.cloudera.com/wp-content/uploads/2010/11/1.png"><img class="aligncenter size-full wp-image-5272" title="1" src="http://www.cloudera.com/wp-content/uploads/2010/11/1.png" alt="" width="1200" /></a></p>
<p>Our work involved developing two major components &#8211; a clustering process that we use to reduce the total search space for each query, and a client-server system for performing fuzzy matching on demand and in parallel across a Hadoop instance.</p>
<p>In the first step, we use two clustering algorithms &#8211; canopy clustering and k-means &#8211; from the Apache Mahout project to assign each biometric into a bin.&#160; Each bin contains biometrics that are statistically similar. Each bin also has a &#8216;mean biometric&#8217; that represents an average of the biometrics contained in that bin.</p>
<p style="text-align: center;"><a href="http://www.cloudera.com/wp-content/uploads/2010/11/2.png"><img class="aligncenter size-full wp-image-5274" title="2" src="http://www.cloudera.com/wp-content/uploads/2010/11/2.png" alt="" width="670" height="578" /></a></p>
<p>When performing queries looking for the best match we first find the bin that a given biometric scores closest to by comparing it to the list of &#8216;mean biometrics&#8217; from our k-means processing. This allows us to avoid searching a large portion of the data set and only search the bin (or small number of bins) that contains the most similar items to the biometric in question.</p>
<p style="text-align: center;"><a href="http://www.cloudera.com/wp-content/uploads/2010/11/3.png"><img class="size-full wp-image-5275 aligncenter" title="3" src="http://www.cloudera.com/wp-content/uploads/2010/11/3.png" alt="" width="960" height="381" /></a></p>
<p>The FuzzyTable client then looks up the location of all blocks or chunks that contain biometrics for that bin by querying the FuzzyTable master server. Blocks are not replicas, rather they contain different sets of data (although using replication to improve the number of concurrent queries is possible). These blocks live on several HDFS DataNodes so that a single bin can be searched by several machines at once.</p>
<p style="text-align: center;"><a href="http://www.cloudera.com/wp-content/uploads/2010/11/4.png"><img class="size-full wp-image-5276 aligncenter" title="4" src="http://www.cloudera.com/wp-content/uploads/2010/11/4.png" alt="" width="1145" height="362" /></a></p>
<p>Finally the FuzzyTable client submits the biometric query and the ID of the closest bin to FuzzyTable query servers running alongside HDFS DataNodes which sift through the blocks of the closest bin comparing the query biometric to each stored biometric and return the most similar matches. The client collects all the matches from FuzzyTable query servers and displays a ranked list of matches to the user.</p>
<p style="text-align: center;"><a href="http://www.cloudera.com/wp-content/uploads/2010/11/5.png"><img class="size-full wp-image-5277 aligncenter" title="5" src="http://www.cloudera.com/wp-content/uploads/2010/11/5.png" alt="" width="927" height="870" /></a></p>
<p>All this takes place in a few seconds. Because Hadoop can distribute our data automatically across multiple machines we only had to write the code to run comparisons on local data. HDFS exposes the location of data blocks making it possible to run the comparisons only on local data, which is also the key to the speed of MapReduce.</p>
<p>The following graph shows how query times (in milliseconds) improved dramatically as we added machines. After about seven machines, our query times became dominated by fixed costs that did not increase with the number of machines. Our test cluster could easily store several times more than our test data before we would see query times increase again.</p>
<p style="text-align: center;">?<a href="http://www.cloudera.com/wp-content/uploads/2010/11/6.png"><img class="size-full wp-image-5278 aligncenter" title="6" src="http://www.cloudera.com/wp-content/uploads/2010/11/6.png" alt="" width="1200" /></a></p>
<p>Large-scale data stack components such as Hadoop and Mahout proved to be a perfect platform on which to create this new capability. Undoubtedly the trend of innovation will continue as we further explore the possibilities of these new reliable and easy-to-use software platforms that achieve scalability through parallel computation over distributed shared-nothing systems.</p>
<p>Booz Allen Hamilton has been at the forefront of strategy and technology consulting for nearly a century. Providing a broad range of services in strategy and organization, technology, operations, and analytics, Booz Allen is committed to delivering results that endure. To learn more, visit www.boozallen.com.</p>
<p>Aaron Cordova is a data analysis consultant at Booz Allen Hamilton, who specializes in large-scale distributed data processing systems.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/11/tackling-large-scale-data-in-government/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Afternoon Hadoop World &#8212; Possible Path Through Great Content</title>
		<link>http://www.cloudera.com/blog/2010/10/afternoon-hadoop-world-possible-path-through-great-content/</link>
		<comments>http://www.cloudera.com/blog/2010/10/afternoon-hadoop-world-possible-path-through-great-content/#comments</comments>
		<pubDate>Fri, 08 Oct 2010 15:27:12 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[CDH]]></category>
		<category><![CDATA[cdlouderaenterprise]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hadoopworld]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[sift]]></category>
		<category><![CDATA[sra]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=5076</guid>
		<description><![CDATA[It&#8217;s been repeated over and over again that Hadoop World is packed with great content, and I will again reaffirm this fact. Take a glance at the agenda to see for yourself all the presentations you surely will not want to miss. This post will take you on a stroll down a possible Hadoop World [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s been repeated over and over again that <a href="http://bit.ly/btowhQ">Hadoop World</a> is packed with great content, and I will again reaffirm this fact. Take a glance at the <a href="http://bit.ly/9Mp3kT">agenda</a> to see for yourself all the presentations you surely will not want to miss. This post will take you on a stroll down a possible Hadoop World afternoon breakout session path.</p>
<p><a href="http://bit.ly/btowhQ"><span style="color: #333333;">SRA, International Inc. is presenting &#8220;SIFTing Clouds&#8221; at 1:45pm with Paul Burkhardt</span></a><span style="color: #333333;">.</span> He will describe the SRA&#8217;s MapReduce implementations of the Scale-Invariant Feature Transform (SIFT) algorithm, a well-known computer vision algorithm used for object recognition. The SIFT MapReduce application enables fast object identification in distributed image datasets.</p>
<p><a href="http://bit.ly/btowhQ"><span style="color: #333333;"><span style="color: #333333;">Next, at 2:20pm Kevin Weil will be presenting &#8220;The Hadoop Ecosystem at Twitter.</span>&#8221;</span></a> This presentation will dive into how Twitter uses applications such as Pig, HBase, and Hive on-top of Hadoop to solve critical business and engineering problems.</p>
<p>Then, at 2:55pm Charles Zedlewski of Cloudera is giving a presentation highlighting updates to <a href="http://bit.ly/9NmQc1"><span style="color: #333333;">Cloudera&#8217;s Distribution for Hadoop (CDH) and to Cloudera Enterprise</span></a><span style="color: #333333;">.</span> This presentation is titled &#8220;Cloudera Roadmap Review&#8221; and will also include valuable insights into development plans for the next 12 months.</p>
<p><a href="http://bit.ly/btowhQ"><img style="float: left; margin-right: 10px; margin-top: 5px;" title="hw_white" src="http://www.cloudera.com/wp-content/uploads/2010/08/hw_white2.gif" alt="" width="169" height="130" /></a>For those who will be attending, I strongly suggest you map out your Hadoop World path early using the <a href="http://bit.ly/9Mp3kT">agenda</a> and <a href="http://bit.ly/cHkRI2">program guide</a>. If you wait until the last minute you will surely find yourself caught between breakout sessions trying to decide which to attend.</p>
<p>There are over 800 people registered for the conference and space is filling fast. If you still need to register you can by clicking <a href="http://bit.ly/91G6B0">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/10/afternoon-hadoop-world-possible-path-through-great-content/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Hadoop for Fraud Detection and Prevention</title>
		<link>http://www.cloudera.com/blog/2010/08/hadoop-for-fraud-detection-and-prevention/</link>
		<comments>http://www.cloudera.com/blog/2010/08/hadoop-for-fraud-detection-and-prevention/#comments</comments>
		<pubDate>Wed, 25 Aug 2010 05:27:20 +0000</pubDate>
		<dc:creator>Alex Kozlov</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[fraud]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4478</guid>
		<description><![CDATA[Learn about fraud and how to prevent it with Hadoop]]></description>
			<content:encoded><![CDATA[<p>Fraud has multiple meanings and the term can be easily abused.&#160; The definition of fraud has undergone multiple changes throughout the years and is elusive as well as fraud itself.&#160; The modern legal definition of fraud usually contains a few elements that have to be proven in court and depends on the state/country.&#160; For example, in California, the elements of fraud, which give rise to the fraud cause of action in the California Courts, are: (a) misrepresentation (false representation, concealment, or nondisclosure); (b) knowledge of falsity (or scienter); (c) intent to defraud, i.e., to induce reliance; (d) justifiable reliance; and (e) resulting damage.&#160; A more general definition may contain up to <a href="http://en.wikipedia.org/wiki/Fraud#Elements_of_fraud">9 elements</a>.</p>
<p>
From the statistical or technical perspective, fraud is a rare event that results in a significant financial impact to the organization.</p>
<p>
Both definitions emphasize that the event is rare (assuming that most of the population is law-abiding citizens), is intentional (there is no &#8220;accidental&#8221; fraud), as well as imply a significant damage caused to the defrauded party (otherwise why bother).&#160; Fraud detection is difficult from statistical point of view for exactly these reasons: (a) the events are rare and it is difficult to build a predictive model and (b) fraud assumes a real human being behind it and incorporates elements of game theory since the fraudster is often an insider who knows how to game the system.</p>
<p><h3>Fraud and Rare Events</h3>
<p>By definition, fraud is an unexpected or rare event with significant financial or other damage.&#160; Fraud assumes that the fraudster has some prior information how the current system works including previous successful and unsuccessful fraud cases and possibly the fraud detection mechanisms.&#160; The above breaks the standard statistical modeling assumption, the variable independence or i.i.d. assumption, making building a reliable statistical model difficult.&#160; Often the fraudster is working in the same industry that the fraud detection is supposed to protect, is intimately familiar with the fraud detection methods, and is actively trying to avoid detection by masquerading.</p>
<p>
Rare event detection problem is also applicable to online advertising and marketing, particularly with predicting &#8220;long tail&#8221; events and terrorism detection.</p>
<p>
One common example of fraud is associated with <a href="http://en.wikipedia.org/wiki/Taleb_distribution" target="_blank">Taleb distribution</a> where a seemingly high probability of a small gain shadows a small probability of a large loss that more than outweighs the gains.&#160; Relatively long periods of slightly better than moderate gains are interrupted by a rare event of large losses.&#160; It is easy to defraud investors by presenting the results of partial analysis excluding the &#8220;rare events&#8221;.</p>
<p><h3>Fraud Prevention</h3>
<p>Since fraud is so hard to prove in courts, most organizations and individuals try to prevent fraud from happening by blanket measures.&#160; This includes limiting the amount of damage the fraudster can impact on the organization as well as early detection of fraud patterns.&#160; For example, credit card companies can cut the credit card limit across the board in anticipation of a few negative fraud cases.&#160; Advertisers can prevent advertising campaigns with low number of qualifying events.&#160; And anti-terrorism agencies can prevent people with bottles of pure water from boarding the planes.&#160; These actions are often in contrast with the company efforts to attract more customers and result in general dissatisfaction.&#160; To the rescue are new technologies like Hadoop, Influence Diagrams and Bayesian Networks which are computationally expensive (these are NP-hard in computer science terminology) but are more accurate and predictive.</p>
<p><h3>Why Hadoop?</h3>
<p>Hadoop is a distributed system for processing large amounts of data.&#160; In a recent Hadoop Summit 2010 Yahoo, Facebook, and other companies announced that they currently process a few TBs of data per day and the volumes are <a href="http://developer.yahoo.net/blogs/theater/archives/2009/06/hadoopsummit_omalley.html" target="_blank">growing at exponential rates</a>.&#160; Hadoop can be vital for solving the fraud detection problem because:</p>
<ol>
<li>Sampling      does not work for rare events since the chance of missing a positive fraud      case leads to significant deterioration of model quality.</li>
<li>Hadoop      can solve much harder problems by leveraging multiple cores across      thousands of machines and search through much larger problem domains.</li>
<li>Hadoop      can be combined with other tools to manage moderate to low response      latency requirements.</li>
</ol>
<p>
Let&#8217;s go through these reasons one by one.&#160; Sampling is a common technique for modeling rare events.&#160; One of the problems with sampling is that we cannot afford to throw away rare positive cases.&#160; Even in a stratified or proportional sampling scheme one has to retain all positive cases since the model accuracy heavily depends on them (one can usually discard some negative cases though).&#160; Given the above, the system still has to go through the whole dataset to sieve through the positive and negative cases.</p>
<p>
Hadoop is known for its gnawing power.&#160; Nothing can compare with the throughput power of thousands of machines each of which has multiple cores.&#160; As was reported recently at the Hadoop Summit 2010, the largest installations of Hadoop have 2,000 to 4,000 computers with 8 to 12 cores each, amounting to up to 48,000 active threads looking for a pattern at the same time.&#160; This allows either (a) looking through larger periods of time to incorporate events across a larger time frame or (b) taking more sources of information into account. &#160;It is quite common among social network companies to comb through twitter blogs in search of relevant data.</p>
<p>
Finally, one of the fraud prevention problems is latency.&#160; The agencies want to react to an event as soon as possible, often within a few minutes of the event.&#160; Yahoo recently reported that it can adjust its behavioral model in a response to a user click event within 5-7 minutes across several hundred of millions of customers and billions of events per day.&#160; Cloudera has developed a tool, Flume, that can load billions of events into HDFS within a few seconds and analyze them using MapReduce.</p>
<p>
Often fraud detection is akin to &#8220;finding a needle in a haystack&#8221;.&#160; One has to go through mountains of relevant and seemingly irrelevant information, build dependency models, evaluate the impact and thwart the fraudster actions.&#160; Hadoop helps with finding patterns by processing mountains of information on thousands of cores in a relatively short amount of time.</p>
<p><h3>Where to look next?</h3>
<p>Techniques for fraud detection are industry-specific as a rule and often are guarded since they obviously represent valuable information for potential fraudsters.&#160; They are often kept confidential for this reason.&#160; Moreover, the fraud detection techniques are usually a moving target since the fraudsters quickly adjust to the new fraud detection mechanisms.</p>
<p>
One of the most publicized technical frauds is click fraud in on-line advertising.&#160; Since advertisers are often charged on the per-click basis &#8212; so called PPC campaigns; there is a way to charge advertisers on a per-conversion basis, which we will cover shortly, but a different type of fraud emerges there where the advertiser tries to conceal the conversions &#8212; the traffic provider like a search web site has a clear incentive to inflate the number.&#160; Additionally, an advertiser competitor may be incentivized to inflate the number to skew the original advertiser margin.&#160; This can be achieved by a human or software agent that generates extra traffic and clicks on the competitor site.&#160; Fraud management companies like <a href="http://www.fraudwall.com/" target="_blank">Anchor Intelligence</a> and <a href="http://www.clickforensics.com/" target="_blank">Click Forensics</a> estimate that approximately 20% to 30% of all clicks are fraud.&#160; How do we know that a click is a fraud?</p>
<p>
Decline in the number of conversions &#8212; first and most important, if your conversion rate is normally positive (that is, you are making a profit on your ad), and all of a sudden, conversion dives into negative numbers, this could be a sign of click fraud in action.&#160; Click fraud causes extra clicks on your ad with no actual purchases, and your conversion rate will fall accordingly.</p>
<p>
An abnormal number of clicks from the same IP address or a pattern in the access times &#8212; although this is the most obvious and easily identified form of click fraud, it is amazing how many fraudsters still use this method, particularly for quick attacks.&#160; They may choose a to strike over a long weekend when they figure you may not be watching your log files carefully, clicking on your ad repeatedly so that when you return to work on Tuesday, your account is significantly depleted.&#160; Part of this fraud might be unintentional when a user tries to reload a page.</p>
<p>
Large &#8220;abandonment rate&#8221;, or numbers of visitors who leave your site quickly &#8212; another indication of click fraud can be a pattern of visitors clicking on your ad, spending the minimum amount of time on your site required by your PPC search engine to establish it as a valid click (usually 30 seconds or more), and then leaving without having left the landing page at all.</p>
<p>
A large number of impressions, without the follow-through clicks or click on your ad &#8212; if you notice that there are a lot more impressions (views) of your website; this could indicate the impression fraud we discussed earlier. Artificial inflation of your ad impressions may cause your clickthrough rates to drop below the Google minimum, and your ad will be disabled.&#160; Until you realize this, your competitors have free reign to use your keywords, sometimes at bargain prices.&#160; As well, your relevancy ratings for search engines may drop as they record numerous impressions, but no interest shown via visits to other parts of your website, which could lead to a shutdown of your campaign.</p>
<p>
Abnormally high clicks and impressions on affiliate websites &#8212; although affiliates themselves are sometimes involved in conducting click fraud schemes, they can be victims of click fraud themselves.&#160; If one of their competitors uses this same method of excessive clicks and impressions on an affiliate&#8217;s site, the PPC search engine will soon notice an abnormally high payment to a certain affiliate and perhaps go as far as canceling that affiliate&#8217;s account, even though he or she was not engaging in any form of click fraud.</p>
<p>
A large number of clicks coming from countries outside of your normal market area &#8212; using IP geo-location services, you can identify which country an IP address is probably coming from.</p>
<p>
In the case of performance-based advertising, the advertiser himself is interested in concealing some of the traffic, not inflating it.&#160; Since most of the performance-based measurements is based in beacons or pixels placed on the advertiser conversion page, advertiser has an incentive to (temporarily) block the traffic from the beacon or to completely remove it from their web-site.</p>
<p>
Fraud is prevalent in telecom industry.&#160; One of the leading commercially available fraud detection products is <a href="http://h20208.www2.hp.com/cms/solutions/ci-b/cv/frm.jsp" target="_blank">HP FMS system</a> on which the author had a pleasure to work personally.&#160; The types of telecom fraud include:</p>
<p>
Subscription fraud &#8212; involves the acquisition of telecommunications services using stolen or false credentials and/or identity with no intention of paying. With subscription fraud, not only do service providers lose revenue, but also individual consumers are vulnerable to having their identity stolen and credit rating tarnished.</p>
<p>
Technical/network fraud &#8212; occurs when someone uses equipment or technology to gain access to a service without paying. Fraudulent calls are typically billed to the legitimate owner of the line or service.&#160; Wireless examples include cloning of cell phones or subscriber identity module (SIM) cards. Fixed line examples include clip on or line tapping, private branch exchange (PBX) hacking and calling card fraud. Prepaid services also have a large exposure to fraud with terminal tampering via magnetic strips or SIM chips, or recharging with stolen credit card numbers.</p>
<p>
Insider fraud &#8212; occurs when individuals inside the operator provide fraudulent access to networks or otherwise thwart the ability of the operator to be paid for services used.</p>
<p>
Handset abuse &#8212; is what takes place when stolen or lost handsets are used to consume telecommunications services that are in turn paid for by the service provider.&#160; This is an expensive liability for carriers who absorb the costs.</p>
<p>
Social engineering &#8212; is an effective fraud technique in which people unwittingly help perpetrators by providing sensitive data, illicit access or simply forwarding their calls without ever knowing they have done anything wrong.</p>
<p>
All these patterns can be detected with special MapReduce pattern detection techniques.  Flume offers low-latency stream processing capabilities.</p>
<p>
Needless to say, the fraudsters also explore the potential market and invent new innovative ways to generate fraud.&#160; One of them is deployed by <a href="http://www.clickmonkeys.com/about" target="_blank">Click Monkeys</a> which deploys a vessel with animals next to the coast of California to generate seemingly random traffic.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/hadoop-for-fraud-detection-and-prevention/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

