<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; HDFS</title>
	<atom:link href="http://www.cloudera.com/blog/category/hdfs/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Thu, 24 May 2012 17:53:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Meet the Presenters: Aaron Myers from Cloudera and Suresh Srinivas from Hortonworks</title>
		<link>http://www.cloudera.com/blog/2012/05/meet-the-presenters-aaron-myers-from-cloudera-and-suresh-srinivas-from-hortonworks/</link>
		<comments>http://www.cloudera.com/blog/2012/05/meet-the-presenters-aaron-myers-from-cloudera-and-suresh-srinivas-from-hortonworks/#comments</comments>
		<pubDate>Tue, 08 May 2012 01:05:24 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[HDFS]]></category>
		<category><![CDATA[HA HDFS]]></category>
		<category><![CDATA[Hadoop Distributed File System]]></category>
		<category><![CDATA[HDFS NameNode]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14766</guid>
		<description><![CDATA[This was originally posted on the Hadoop Summit 2012 blog. Today’s “Meet the Presenters” interview features two speakers: Aaron Myers from Cloudera and Suresh Srinivas from Hortonworks. Aaron and Suresh will be presenting on HDFS NameNode High Availability, one of the hottest topics in the Apache Hadoop space today. Question: Tell us about your current role and [...]]]></description>
			<content:encoded><![CDATA[<p><em>This was originally posted on the Hadoop Summit 2012 <a href="http://hadoopsummit.org/blog/meet-the-presenters-aaron-myers-from-cloudera-and-suresh-srinivas-from-hortonworks/" target="_blank">blog</a></em>.</p>
<p>Today’s “Meet the Presenters” interview features two speakers: Aaron Myers from Cloudera and Suresh Srinivas from Hortonworks. Aaron and Suresh will be presenting on <a title="Apache Hadoop Summit" href="http://hadoopsummit.org/program/#session53" target="_blank">HDFS NameNode High Availability</a>, one of the hottest topics in the Apache Hadoop space today.</p>
<h2>Question: Tell us about your current role and how you interact with Apache Hadoop?</h2>
<p><strong>Aaron: </strong>I work full-time developing Hadoop and supporting Hadoop’s many users. My efforts are primarily focused on HDFS and Hadoop’s security infrastructure.</p>
<p><strong>Suresh: </strong>I have been working on Hadoop for about 4 years. Currently I am on HDFS full-time, with focus on improving reliability, scalability and developing enterprise features. I also work on expanding Apache Hadoop APIs and interfaces to enable new use cases and simplify integration of other solutions with HDFS.</p>
<h2>Question: Tell us about your Hadoop Summit presentation?</h2>
<p><strong>Suresh:</strong> The HDFS NameNode is a robust and reliable service as seen in practice in production at Yahoo! and other organizations. However, the NameNode does not have automatic failover support. A hot failover solution called HA NameNode is currently under active development (<a title="Apache Hadoop HDFS" href="https://issues.apache.org/jira/browse/HDFS-1623" target="_blank">HDFS-1623</a>). This talk will cover the architecture, design and setup. We will also discuss the future direction for HA NameNode.</p>
<h2>Question: What do you expect will be the key takeaway for folks attending your session? </h2>
<p><strong>Aaron:</strong> Because we will be sharing best practices and architectural details, we expect attendees to walk away with a good understanding of what’s required to deploy and operate a highly available HDFS NameNode.</p>
<h2>Question: What are you most looking forward to at Hadoop Summit?</h2>
<p><strong>Aaron: </strong>Chatting in-person with the Hadoop developers and other community members who I interact with frequently, but don’t get to see often.</p>
<p><strong>Suresh:</strong> Interacting with the community and learning from them their Hadoop experiences. I’m also interested in getting feedback on things we can improve and important new features desired by the community.</p>
<h2>Question: What other presentations are you most looking forward to attending?</h2>
<p><strong>Aaron:</strong></p>
<ul>
<li> <a title="Apache Hadoop Summit" href="http://hadoopsummit.org/program/#session29" target="_blank">I accidentally the Namenode: Hadoop Distributed Filesystem Reliability and Durability at Facebook</a> by Andrew Ryan</li>
<li><a title="Apache Hadoop Summit" href="http://hadoopsummit.org/program/#session32" target="_blank">Optimizing MapReduce Job Performance</a> by Todd Lipcon</li>
</ul>
<p><strong> Suresh:</strong></p>
<ul>
<li>Like Aaron,  <a title="Apache Hadoop Summit" href="http://hadoopsummit.org/program/#session29" target="_blank">I accidentally the Namenode: Hadoop Distributed Filesystem Reliability and Durability at Facebook</a> by Andrew Ryan</li>
<li><a title="Apache Hadoop Summit" href="http://hadoopsummit.org/program/#session36" target="_blank">Apache Hadoop and Virtual Machines</a> by Richard McDougall and Sanjay Radia</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/meet-the-presenters-aaron-myers-from-cloudera-and-suresh-srinivas-from-hortonworks/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>High Availability for the Hadoop Distributed File System (HDFS)</title>
		<link>http://www.cloudera.com/blog/2012/03/high-availability-for-the-hadoop-distributed-file-system-hdfs/</link>
		<comments>http://www.cloudera.com/blog/2012/03/high-availability-for-the-hadoop-distributed-file-system-hdfs/#comments</comments>
		<pubDate>Wed, 07 Mar 2012 13:00:59 +0000</pubDate>
		<dc:creator>Aaron Myers</dc:creator>
				<category><![CDATA[CDH]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[hadoop high availability]]></category>
		<category><![CDATA[hdfs high availability]]></category>
		<category><![CDATA[hdfs name node]]></category>
		<category><![CDATA[high availability]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=13415</guid>
		<description><![CDATA[Background Apache Hadoop consists of two primary components: HDFS and MapReduce. HDFS, the Hadoop Distributed File System, is the primary storage system of Hadoop, and is responsible for storing and serving all data stored in Hadoop. MapReduce is a distributed processing framework designed to operate on data stored in HDFS. HDFS has long been considered [...]]]></description>
			<content:encoded><![CDATA[<h2 style="font-size: 14pt;">Background</h2>
<p>Apache Hadoop consists of two primary components: HDFS and MapReduce. HDFS, the Hadoop Distributed File System, is the primary storage system of Hadoop, and is responsible for storing and serving all data stored in Hadoop. MapReduce is a distributed processing framework designed to operate on data stored in HDFS.</p>
<p>HDFS has long been considered a highly <em>reliable</em> file system.  An empirical <a href="http://www.youtube.com/watch?v=zbycDpVWhp0">study done at Yahoo!</a> concluded that across Yahoo!’s 20,000 nodes running Apache Hadoop in 10 different clusters in 2009, HDFS lost only 650 blocks out of 329 million total blocks. The vast majority of these lost blocks were due to a handful of bugs which have long since been fixed.</p>
<p>Despite this very high level of reliability, HDFS has always had a well-known single point of failure which impacts HDFS’s <em>availability</em>: the system relies on a single Name Node to coordinate access to the file system data. In clusters which are used exclusively for ETL or batch-processing workflows, a brief HDFS outage may not have immediate business impact on an organization; however, in the past few years we have seen HDFS begin to be used for more interactive workloads or, in the case of HBase, used to directly serve customer requests in real time. In cases such as this, an HDFS outage will immediately impact the productivity of internal users, and perhaps result in downtime visible to external users. For these reasons, adding high availability (HA) to the HDFS Name Node became one of the top priorities for the HDFS community.</p>
<p>The remainder of this post discusses the implementation of a new feature for HDFS, called the “HA Name Node.” For a detailed discussion of other issues surrounding the availability of Hadoop as a whole, take a look at this <a href="http://www.cloudera.com/blog/2011/02/hadoop-availability/">excellent blog post</a> by my colleague Eli Collins.<strong><strong></strong></strong></p>
<h2 style="font-size: 14pt;">High-level Architecture</h2>
<p>The goal of the HA Name Node project is to add support for deploying two Name Nodes in an active/passive configuration. This is a common configuration for highly-available distributed systems, and HDFS’s architecture lends itself well to this design. Even in a non-HA configuration, HDFS already requires both a Name Node and another node with similar hardware specs which performs checkpointing operations for the Name Node. The design of the HA Name Node is such that the passive Name Node is capable of performing this checkpointing role, thus requiring no additional Hadoop server machines beyond what HDFS already requires.<img src="http://www.cloudera.com/wp-content/uploads/2012/03/HANNdiagram-2.png" alt="Hadoop Distributed File System High Available Name Node" /></p>
<p>The HDFS Name Node is primarily responsible for serving two types of file system metadata: file system namespace information and block locations. Because of the architecture of HDFS, these must be handled separately.<strong><strong></strong></strong></p>
<p><strong>Namespace Information</strong></p>
<p>All mutations to the file system namespace, such as file renames, permission changes, file creations, block allocations, etc, are written to a persistent write-ahead log by the Name Node before returning success to a client call. In addition to this edit log, periodic checkpoints of the file system, called the fsimage, are also created and stored on-disk on the Name Node. Block locations, on the other hand, are stored only in memory. The locations of all blocks are received via “block reports” sent from the Data Nodes when the Name Node is started.</p>
<p>The goal of the HA Name Node is to provide a <em>hot standby</em> Name Node that can take over serving the role of the active Name Node with no downtime. To provide this capability, it is critical that the standby Name Node has the most complete and up-to-date file system state possible in memory. Empirically, starting a Name Node from cold state can take tens of minutes to load the namespace information (fsimage and edit log) from disk, and up to an hour to receive the necessary block reports from all Data Nodes in a large cluster.</p>
<p>The Name Node has long supported the ability to write its edit logs to multiple, redundant local directories. To address the issue of sharing state between the active and standby Name Nodes, the HA Name Node feature allows for the configuration of a special shared edits directory. This directory should be available via a network file system, and should be read/write accessible from both Name Nodes. This directory is treated as being <em>required</em> by the active Name Node, meaning that success will not be returned to a client call unless the file system change has been written to the edit log in this directory. The standby Name Node polls the shared edits directory frequently, looking for new edits written by the active Name Node, and reads these edits into its own in-memory view of the file system state.</p>
<p>Note that requiring a single shared edits directory does not necessarily imply a new single point of failure. It does, however, mean that the filer providing this shared directory must itself be HA, and that multiple network routes should be configured between the Name Nodes and the service providing this shared directory. Plans to improve this situation are discussed further below.</p>
<p><strong>Block Locations</strong></p>
<p>The other part of keeping the standby Name Node hot is making sure that it has up-to-date block location information. Since block locations aren’t written to the Name Node edit log, reading from the shared edits directory is not sufficient to share this file system metadata between the two Name Nodes. To address this issue, when HA is enabled, all Data Nodes in the cluster are configured with the network addresses of both Name Nodes. Data Nodes send all block reports, block location updates, and heartbeats to both Name Nodes, but Data Nodes will only act on block commands issued by the currently-active Name Node.</p>
<p>With both up-to-date namespace information and block locations in the standby Name Node, the system is able to perform a failover from the active Name Node to the standby with no delay.</p>
<p><strong>Client Failover</strong></p>
<p>Since multiple distinct daemons are now capable of serving as the active Name Node for a single cluster, the HDFS client must be able to determine which Name Node to communicate with at any given time. The HA Name Node feature does not support an active-active configuration, and thus all client calls must go to the active Name Node in order to be served.</p>
<p>To implement this feature, the HDFS client was extended to support the configuration of multiple network addresses, one for each Name Node, which collectively represent the HA name service. The name service is identified by a single <em>logical URI</em>, which is mapped to the two network addresses of the HA Name Nodes via client-side configuration. These addresses are tried in order by the HDFS client. If a client makes a call to the standby Name Node, a special result is returned to the client, indicating that it should retry elsewhere. The configured addresses are tried in order by the client until an active Name Node is found.</p>
<p>In the event that the active Name Node crashes while in the middle of processing a request, the client will be unable to determine whether or not the request was processed. For many operations such as reads (or <a href="http://en.wikipedia.org/wiki/Idempotent">idempotent</a> writes such as setting permissions, setting modification time, etc), this is not a problem &#8212; the client may simply retry after the failover has completed. For others, the error must be bubbled up to the caller to be correctly handled. In the course of the HA project, we extended the Hadoop IPC system to be able to classify each operation’s idempotence using special annotations.</p>
<h2 style="font-size: 14pt;">Current Status</h2>
<p>Active development work began on the HA Name Node in August 2011, in a branch off of Apache Hadoop trunk. Development was done under the umbrella JIRAs <a href="https://issues.apache.org/jira/browse/HDFS-1623">HDFS-1623</a> and <a href="https://issues.apache.org/jira/browse/HADOOP-7454">HADOOP-7454</a>. Last Friday, March 2nd 2012 we merged this branch back into Apache Hadoop trunk. We closed over 170 individual JIRAs in the course of implementing this feature. The stated intention of the community is to merge this work from HDFS trunk into the 0.23 branch, where it will be released as an update of the Apache Hadoop 0.23 release line. Much of this work is already available as part of <a href="http://www.cloudera.com/blog/2012/02/introducing-cdh4/">CDH4 beta 1, released on February 13th, 2012.</a></p>
<p>Once a failover has been initiated, the actual process of stopping the active and starting the standby Name Node takes a matter of seconds or less. This speed allows for little or no detectable service disruption during a failover. I’ve personally run hundreds of MR jobs over a running HA cluster, doing failovers back and forth between two HA Name Nodes, without any job failures.</p>
<p>This first implementation of the HA Name Node supports only manual failover &#8212; that is, failure of one of the Name Nodes is not automatically detected by the system, but rather requires intervention by an operator to initiate a failover between the Name Nodes. Though this is an obvious limitation, this version should still be useful to eliminate the need for planned HDFS downtime in many cases, e.g. changing the configuration of the Name Node, scheduled hardware maintenance of a Name Node, or scheduled OS upgrade of a Name Node.</p>
<h2 style="font-size: 14pt;">Next Up</h2>
<p>The highest priority feature to add to the HA Name Node implementation is support for automatically detecting the failure of the Active Name Node and initiating a failover to the Standby when it is determined that the Active is no longer functional. <a href="https://issues.apache.org/jira/browse/HDFS-3042">HDFS-3042</a> and its sub-tasks are actively being worked on to provide this functionality.</p>
<p>The dependence on an HA filer for HDFS edit logs is a limitation that we’d like to address in the near to medium term as well. Several different options have been discussed to address this:<strong><strong><br /></strong></strong></p>
<ul>
<li><strong>BookKeeper</strong> &#8211; <a href="http://zookeeper.apache.org/doc/r3.2.2/bookkeeperStarted.html">BookKeeper</a> is a highly available write-ahead logging system. Work has already been done to allow the HDFS Name Node to be able to write its edits log to BookKeeper, though this has not yet been tested with the HA Name Node.</li>
<li><strong>Multiple, non-HA filers</strong> &#8211; the HA Name Node presently only supports logging to a single shared edits directory. Perhaps the easiest improvement from the current situation would be to allow the Name Node to log to several shared edits directories, and require that all edits be logged to a quorum of shared edits directories. This proposal is being tracked by <a href="https://issues.apache.org/jira/browse/HDFS-2782">HDFS-2782</a>.</li>
<li><strong>Stream edits to remote NNs</strong> &#8211; in addition to writing edits to a local file system, edit log entries could be sent directly to other Name Nodes over the network. The active Name Node would require a quorum of the involved Name Nodes to acknowledge receipt of the edits before responding with success to the client call.</li>
<li><strong>Store edit logs in HDFS itself</strong> &#8211; systems such as HBase already use HDFS to store a write-ahead log of all data mutations. If HDFS were extended to have a modicum of bootstrapping information, it is not inconceivable that HDFS edit logs could be stored in HDFS itself. This proposal is being discussed on <a href="https://issues.apache.org/jira/browse/HDFS-2601">HDFS-2601</a>.</li>
</ul>
<p>In the next few weeks, we will be evaluating all of these options and selecting one to implement.<br /><strong id="internal-source-marker_0.7045455041807145"><br /></strong>Currently, deploying HA Name Nodes is somewhat cumbersome, requiring the operator to <a href="https://ccp.cloudera.com/display/CDH4B1/HDFS+High+Availability+Initial+Deployment">manually synchronize the on-disk metadata</a> of the two Name Nodes. <a href="https://issues.apache.org/jira/browse/HDFS-2731">HDFS-2731</a> aims to improve the user experience of this deployment process by having the second Name Node automatically synchronize itself with the state of the first Name Node. This feature will make the process faster and less error prone.</p>
<h2 style="font-size: 14pt;">Further Reading</h2>
<p>Take a look at the <a href="https://ccp.cloudera.com/display/CDH4B1/CDH4+Beta+1+High+Availability+Guide">CDH4 docs</a> for detailed information on configuring the HA Name Node in CDH4.</p>
<p>Be on the lookout for an upcoming blog post from my colleague Todd Lipcon, which will go into greater detail about some of the specific challenges encountered while implementing the HA Name Node feature, and how these issues were overcome.</p>
<h2 style="font-size: 14pt;">Acknowledgments</h2>
<p>This work has been a community effort from the start, and represents the work of many contributors. Both the architecture and implementation were the collaborative effort of many. In particular, this work would not have been possible without contributions from Todd Lipcon, Eli Collins, Uma Maheswara Rao G, Bikas Saha, Suresh Srinivas, Jitendra Nath Pandey, Hari Mankude, Brandon Li, Sanjay Radia, Mingjie Lai, and Gregory Chanan. Also thanks to Dhruba Borthakur and Konstantin Shvachko for helpful design discussions and recommendations on testing. Thanks also to Stephen Chu, Wing Yew Poon, and Patrick Ramsey for their help in testing the HA Name Node.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/03/high-availability-for-the-hadoop-distributed-file-system-hdfs/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Hadoop for Archiving Email &#8211; Part 2</title>
		<link>http://www.cloudera.com/blog/2012/01/hadoop-for-archiving-email-part-2/</link>
		<comments>http://www.cloudera.com/blog/2012/01/hadoop-for-archiving-email-part-2/#comments</comments>
		<pubDate>Tue, 03 Jan 2012 13:00:04 +0000</pubDate>
		<dc:creator>Sunil Sitaula</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[Use Case]]></category>
		<category><![CDATA[hadoop use case]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9944</guid>
		<description><![CDATA[Part 1 of this post covered how to convert and store email messages for archival purposes using Apache Hadoop, and outlined how to perform a rudimentary search through those archives. But, let&#8217;s face it: for search to be of any real value, you need robust features and a fast response time. To accomplish this we [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cloudera.com/blog/2011/09/hadoop-for-archiving-email/" target="_blank">Part 1</a> of this post covered how to convert and store email messages for archival purposes using Apache Hadoop, and outlined how to perform a rudimentary search through those archives. But, let&#8217;s face it: for search to be of any real value, you need robust features and a fast response time. To accomplish this we use Solr/Lucene-type indexing capabilities on top of HDFS and MapReduce.</p>
<p>Before getting into indexing within Hadoop, let us review the features of Lucene and Solr:</p>
<h2>Apache Lucene and Apache Solr</h2>
<p>Apache Lucene is a mature, high performance, full-featured Java API used for indexing and searching that has been around since the late nineties &#8212; it supports field-specific indexing and searching, sorting, highlighting, and wildcard searches, to name only a few. Everything in Lucene boils down to creating a document using artifacts such as email messages, HTML, PDF, XML, Word, Excel, etc, the contents of which will end up being parsed and added to Lucene documents as name/value pairs.&#160; There are a number of libraries available for extracting actual content, depending on what the artifact is. When extracting content from .msg email files, for instance, TIKA and POI are some useful libraries.</p>
<p>Once you have added name/value pairs from the email content to the document, the index portion is taken care of. We can then use IndexSearcher to search through the indexed contents as illustrated below, in Figure 1:</p>
<p style="text-align: center;"><img src="https://www.cloudera.com/wp-content/uploads/2011/12/1.-Indexing-and-Searching-using-Lucene.jpg" alt="" /></p>
<p style="text-align: center;"><strong>Figure 1: Indexing and Searching using Lucene.</strong></p>
<p><strong>Apache Solr</strong>, on the other hand, is a Lucene-based full text search server with XML, JSON, and HTTP APIs, which has a web admin interface and provides extensive caching, replication, search distribution, as well as the ability to add customized plugins. Solr already includes various parsing libraries, including Tika, POI, and TagSoup, among others.</p>
<p>Figure 2 below illustrates the Solr components and deployment architecture:</p>
<p style="text-align: center;"><img src="https://www.cloudera.com/wp-content/uploads/2011/12/2.-Solr-Components-and-Deployment-Architecture.jpg" alt="" /></p>
<p style="text-align: center;"><strong>Figure 2: Solr Components and Deployment Architecture.</strong></p>
<p>Next, let&#8217;s explore how you can use both Solr and Lucene within the Hadoop environment for indexing and searching massive amounts of data:</p>
<p>First, you need to get data into HDFS, as covered in <a href="http://www.cloudera.com/blog/2011/09/hadoop-for-archiving-email/" target="_blank"><em>Hadoop for Archiving Email &#8211; Part 1</em></a>. Once the data is there, you can start to run MapReduce to create indexes in parallel that can then be dumped into HDFS or into a Local File System. If an index is stored within a Local File System, simply serve it from there by pointing Solr to it. It can be run either in single, replicated or distributed mode, depending on the size of the index to serve.&#160; However, if you need to make search available for only for a small number of users, you can simply store data directly in&#160; HDFS and provide an interface for your users to access it directly.</p>
<p>One tool that I find very handy for providing such an interface is <a href="http://www.getopt.org/luke/" target="_blank">Luke</a>. Luke was built for development and diagnostic purposes, and can be used to search, display and browse the results of a Lucene index. With Luke, you can view documents, analyze results, copy and delete them, or optimize indexes that have already been built. The best part: you can easily operate on multipart indexes stored directly in HDFS, as illustrated in Figure 3:</p>
<p style="text-align: center;"><img src="https://www.cloudera.com/wp-content/uploads/2011/12/3.-Indexing-and-Searching-within-HDFS-or-Local-Filesystem.jpg" alt="" /></p>
<p style="text-align: center;"><strong>Figure 3: Indexing and Searching within HDFS or Local Filesystem.</strong></p>
<p>Having discussed design at a high level, let&#8217;s now dive deeper into the details of MapReduce for creating an index.</p>
<p>Here is how the configure portion of each mapper could look:</p>
<pre class="code">//initialize indexWriter..
Analyzer analyzer = <strong>new</strong> StandardAnalyzer(Version.<em>LUCENE_33</em>);
IndexWriterConfig iwc = <strong>new</strong> IndexWriterConfig(Version.<em>LUCENE_33</em>, analyzer);
//if we are writing to <span style="text-decoration: underline;">hdfs</span>, then use RAMDirectory
<strong>if</strong> (toHDFS){
&nbsp; &nbsp; &nbsp; iwc.setOpenMode(OpenMode.<em>CREATE</em>);
&nbsp; &nbsp; &nbsp; idx =&#160; <strong>new</strong> RAMDirectory();
&nbsp; &nbsp; &nbsp; writer = <strong>new</strong> IndexWriter(idx, iwc);
} <strong>else</strong> {
&nbsp; &nbsp; &nbsp;//use CREATE_OR_APPEND so if index already exists it will simply be appended to
&nbsp; &nbsp; &nbsp;iwc.setOpenMode(OpenMode.<em>CREATE_OR_APPEND</em>);
&nbsp; &nbsp; &nbsp;idx = FSDirectory.<em>open</em>(<strong>new</strong> File(outputDir));
&nbsp; &nbsp; &nbsp;writer = <strong>new</strong> IndexWriter(idx, iwc);
}</pre>
<ol style="padding-top: 12px; padding-left: 20px;">
<li>Initialize the analyzer and index writer config.</li>
<li>If writing to HDFS, you can use RAMDirectory to hold the indexes created; and once complete, flush to HDFS.</li>
<li>If writing to a local file system, simply create FSDirectory with the location.</li>
</ol>
<p>Having configured the mapper, let&#8217;s look at the map method:</p>
<pre class="code"><strong>public</strong> <strong>void</strong> map(LongWritable key, Text value, OutputCollector<Text, Text> output,
Reporter reporter) <strong>throws</strong> IOException {
<strong>try</strong> {
&nbsp; &nbsp; &nbsp;Document doc = <strong>new</strong> Document();
&nbsp; &nbsp; &nbsp;//add email file path
&nbsp; &nbsp; &nbsp;String path = key.toString();
&nbsp; &nbsp; &nbsp;Fieldable field = <strong>new</strong> Field("path", path,
&nbsp; &nbsp; &nbsp;Field.Store.<em>YES</em>, Field.Index.<em>ANALYZED</em>);
&nbsp; &nbsp; &nbsp;doc.add(field);

&nbsp; &nbsp; &nbsp;//convert content into MapiMessage
&nbsp; &nbsp; &nbsp;InputStream input = <strong>new</strong>
&nbsp; &nbsp; &nbsp;ByteArrayInputStream(value.getBytes());
&nbsp; &nbsp; &nbsp;MAPIMessage msg = <strong>new</strong> MAPIMessage(input);
&nbsp; &nbsp; &nbsp;//add recipient as stored and analyzed field so we can
&nbsp; &nbsp; &nbsp;//search based on recipient and display recipient name in the results
&nbsp; &nbsp; &nbsp;String recipient = msg.getRecipientEmailAddress();
&nbsp; &nbsp; &nbsp;field = <strong>new</strong> Field("receipient",
&nbsp; &nbsp; &nbsp;recipient,Field.Store.<em>YES</em>, Field.Index.<em>ANALYZED</em> );
&nbsp; &nbsp; &nbsp;doc.add(field);

&nbsp; &nbsp; &nbsp;String subject = msg.getSubject();
&nbsp; &nbsp; &nbsp;field = <strong>new</strong> Field("subject", subject, Field.Store.<em>YES</em>,
&nbsp; &nbsp; &nbsp;Field.Index.<em>ANALYZED</em> );
&nbsp; &nbsp; &nbsp;doc.add(field);

&nbsp; &nbsp; &nbsp;String content = msg.getTextBody();
&nbsp; &nbsp; &nbsp;field = <strong>new</strong> Field("content", content, Field.Store.<em>YES</em>,
&nbsp; &nbsp; &nbsp;Field.Index.<em>ANALYZED</em> );
&nbsp; &nbsp; &nbsp;doc.add(field);

&nbsp; &nbsp; &nbsp;//add more fine grained fields based on search criteria needed
&nbsp; &nbsp; &nbsp;...

&nbsp; &nbsp; &nbsp;writer.addDocument(doc);
} <strong>catch</strong> (Exception e) {
&nbsp; &nbsp; &nbsp;e.printStackTrace();
}
}
</pre>
<p style="padding-top: 12px;">Once it&#8217;s configured within each map, the exercise boils down to parsing the content and adding it to the writer. The above code does the following:</p>
<ol style="padding-left: 20px;">
<li>Create a Lucene document.</li>
<li>Parse email content passed into MAPIMessage.</li>
<li>Extract necessary fields and add it to the index. (In this example I only extract recipient, subject and content. You can add more fields as necessary and use Hbase to store the content if needed.) </li>
</ol>
<p>At this point, the index has been created &#8211; either in memory or the Local File System. The final task is to close the index to make it searchable, which can be done within the close method of the mapper, as demonstrated below:</p>
<pre class="code"><strong>public</strong> <strong>void</strong> close() {
<strong>try</strong> {
writer.optimize();
writer.close();
<strong>if</strong> (toHDFS) {
&nbsp; &nbsp; &nbsp;String files[]= idx.listAll();
&nbsp; &nbsp; &nbsp;FileSystem dfs = FileSystem.<em>get</em>(localConf);
&nbsp; &nbsp; &nbsp;<strong>for</strong> (<strong>int</strong> i = 0; i < files.length; i++){
&nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;//Read data into byte array, create file in HDFS, write bytes to that file
&nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp; ....
&nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;fs.write(array);
&nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;fs.close();
&nbsp; &nbsp; &nbsp;}
}
} <strong>catch</strong> (CorruptIndexException e) {
&nbsp; &nbsp; &nbsp;e.printStackTrace();
} <strong>catch</strong> (IOException e) {
&nbsp; &nbsp; &nbsp;e.printStackTrace();
}
}
</pre>
<ol style="padding-left: 20px; padding-top: 12px;">
<li>Close the created index.</li>
<li>In the case of HDFS, walk through the index in memory and write it to HDFS.</li>
</ol>
<p>The index should now have been created, either in the Local File System of each of the DataNodes, or in HDFS directly. If it is in the Local File System, you can opt to make the directory part of the &#8220;www&#8221; directory and enable Solr to serve it from there. If it is in HDFS, one could load the index in RAMDirectory within each mapper and search, use a tool like Luke to provide a search interface, or put a mechanism in place to copy it to the Local File System to point Solr at it.</p>
<h2>Appending</h2>
<p>Appending to an existing index can be a bit tricky. If the index sits in a Local File System, this can be accomplished by setting the index writer to APPEND mode and adding new documents. This can get a bit more complicated, however, when the index is in HDFS. One option would be to write an index to a new directory in HDFS, then merge with the existing index.</p>
<h2>SolrCloud and Katta</h2>
<p>Since we are discussing fast search options, it also makes sense to touch on components like SolrCloud and Katta.</p>
<p>SolrCloud enables clusters of Solr instances to be created, with a central configuration, automatic load balancing, resizing, rebalancing and fail-over.</p>
<p>Katta serves indexes in a distributed manner similar to HDFS. It has built in replication for fail-over and performance, is easy to integrate with Hadoop clusters and has master fail-over. However, it does not provide real-time updates, nor is it an indexer &#8211; it is simply a serving tool for Lucene indexes.</p>
<p>In Part 3 of this series, I will cover ways to ingest such email messages and ways to put the steps involved in a workflow. In the meantime, drop us a line if you have any questions on storing email message in Hadoop and index and search them using Solr and Lucene.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/01/hadoop-for-archiving-email-part-2/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Hadoop World 2011: A Glimpse into Development</title>
		<link>http://www.cloudera.com/blog/2011/10/hadoop-world-2011-a-glimpse-into-development/</link>
		<comments>http://www.cloudera.com/blog/2011/10/hadoop-world-2011-a-glimpse-into-development/#comments</comments>
		<pubDate>Wed, 12 Oct 2011 13:00:42 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[careers]]></category>
		<category><![CDATA[CDH]]></category>
		<category><![CDATA[Cloudera's Service and Configuration Manager]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[Connector]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[Flume]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[oozie]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[sqoop]]></category>
		<category><![CDATA[training]]></category>
		<category><![CDATA[Use Case]]></category>
		<category><![CDATA[ZooKeeper]]></category>
		<category><![CDATA[hadoop conference]]></category>
		<category><![CDATA[hadoop event]]></category>
		<category><![CDATA[hadoop world]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9240</guid>
		<description><![CDATA[The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hadoopworld.com/"><img style="float: left; padding-right: 20px;" title="Register for Hadoop World" src="https://www.cloudera.com/wp-content/uploads/2010/08/hw_white2.gif" alt="" /></a></p>
<p>The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.</p>
<h2 style="font-size: 14pt; color: #344152;"><a href="http://www.hadoopworld.com/tracks/development-developers/" target="_blank">Preview of Development Track Sessions</a></h2>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Building Web Analytics Processing on Hadoop at CBS Interactive</span></a><br />
 <em>Michael Sun, CBS Interactive</em></p>
<p><strong>Abstract:</strong> CBS Interactive successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties that CBS Interactive oversees. After introducing Lumberjack&#8212;the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release&#8212;Michael will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, CBS Interactive achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).</p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Gateway: Cluster Virtualization Framework</span></a><br />
<em>Konstantin Shvachko, eBay</em></p>
<p><strong>Abstract:</strong> Access to Hadoop clusters through dedicated portal nodes (typically located behind firewalls and performing user authentication and authorization) can have several drawbacks &#8212; as shared multitenant resources they can create contention among users and increase the maintenance overhead for cluster administrators. This session will discuss the Gateway system, a cluster virtualization framework that provides multiple benefits: seamless access from users&#8217; workplace computers through corporate firewalls; the ability to failover to active clusters for scheduled or unscheduled downtime, as well as the ability to redirect traffic to other clusters during upgrades; and user access to clusters running different versions of Hadoop. </p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">SHERPASURFING &#8211; Open Source Cyber Security Solution</span></a><br />
<em>Wayne Wheeles, Novii Design</em></p>
<p><strong>Abstract:</strong> Every day billions of packets, both benign and some malicious, flow in and out of networks. Every day it is an essential task for the modern Defensive Cyber Security Organization to be able to reliably survive the sheer volume of data, bring the NETFLOW data to rest, enrich it, correlate it and perform. SHERPASURFING is an open source platform built on the proven Cloudera&#8217;s Distribution including Apache Hadoop that enables organizations to perform the Cyber Security mission and at scale at an affordable price point. This session will include an overview of the solution and components, followed by a demonstration of analytics. </p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools</span></a><br />
<em>Arvind Prabhakar, Cloudera<br />
Guy Harrison, Quest Software</em></p>
<p><strong>Abstract:</strong> As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative. We&#8217;ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we&#8217;ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc. </p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Next Generation Apache Hadoop MapReduce</span></a><br />
<em>Mahadev Konar, Hortonworks</em></p>
<p><strong>Abstract:</strong> The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale, high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization. We will be presenting the architecture and design of the next generation of map reduce and will delve into the details of the architecture that makes it much easier to innovate. We will also be presenting large scale and small scale comparisons on some benchmarks with MRV1.&#8221; </p>
<p><a href="http://www.hadoopworld.com/"><img title="Register for Hadoop World" src="https://www.cloudera.com/wp-content/uploads/2010/12/registernow.gif" alt="Register for Hadoop World" /></a></p>
<p>There are several <a href="http://www.hadoopworld.com/training/">training classes</a> and <a href="http://www.hadoopworld.com/training/">certification sessions</a> provided surrounding the Hadoop World conference. Don&#8217;t forget to register and become <a href="http://www.hadoopworld.com/training/">Cloudera Certified in Apache Hadoop</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/10/hadoop-world-2011-a-glimpse-into-development/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>CDH3 Update 1 Released</title>
		<link>http://www.cloudera.com/blog/2011/07/cdh3u1-released/</link>
		<comments>http://www.cloudera.com/blog/2011/07/cdh3u1-released/#comments</comments>
		<pubDate>Fri, 22 Jul 2011 19:00:05 +0000</pubDate>
		<dc:creator>Charles Zedlewski</dc:creator>
				<category><![CDATA[CDH]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[oozie]]></category>
		<category><![CDATA[sqoop]]></category>
		<category><![CDATA[apache hadoop]]></category>
		<category><![CDATA[CDH update]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[Hue]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=8347</guid>
		<description><![CDATA[Announcing an update to CDH3.]]></description>
			<content:encoded><![CDATA[<p>Continuing with our practice from Cloudera&#8217;s Distribution Including Apache Hadoop v2 (CDH2), our goal is to provide regular (quarterly), predictable updates to the generally available release of our open source distribution.&#160; For CDH3 the first such update is available today, approximately 3 months from when CDH3 went GA.</p>
<p>For those of you who are recent Cloudera users, here is a refresh on our update policy:</p>
<ul>
<li>We will only include patches in updates that are non-compatibility breaking.</li>
<li>We will only include patches in updates that are non-disruptive.</li>
<li>You can skip updates without penalty &#8211; i.e., if you don&#8217;t find the contents of an update compelling, you can skip it and wait for a future update without having to do a delta upgrade.</li>
</ul>
<p>There is one new addition to our update policy going forward: when it&#8217;s possible to pull features from our CDH4 roadmap into CDH3 updates in a non-disruptive way, we&#8217;ll take advantage of that opportunity.</p>
<p>With all that said, there are a number of improvements coming to CDH3 with update 1. &#160;Among them are:</p>
<ol>
<li>New features &#8211; integrated Apache-compatible licensed fast compression throughout CDH, web shell for Hue, Flume / HBase integration, Fair Scheduler ACL&#8217;s, improved datanode handling of hard drive failures, and email actions and date formatting for Oozie.</li>
<li>Improvements (stability and performance) &#8211; HBase bulk loading, Namenode stability, Fuse-DFS (mountable HDFS).</li>
<li>New component versions &#8211; Hive 0.7.1, Pig 0.8.1, Hbase 0.90.3, Flume 0.9.4 and Sqoop 1.3.</li>
<li>Bug fixes &#8211; 80+ bug fixes. &#160;Per our standard practice, the enumerated fixes and their corresponding Apache project jiras are provided in the release notes. </li>
</ol>
<p>Update 1 is available in all the usual formats (RHEL, SLES, Ubuntu, Debian packages, tarballs, and SCM Express). &#160;Check out <a href="https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation">the installation docs</a> for instructions. If you&#8217;re running components from the Cloudera Management Suite they will not be impacted by moving to update 1.  The next update (update 2) for CDH3 is planned for mid-October.</p>
<p>Thank you for supporting Apache Hadoop and thank you for supporting Cloudera.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/07/cdh3u1-released/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Hoop &#8211; Hadoop HDFS over HTTP</title>
		<link>http://www.cloudera.com/blog/2011/07/hoop-hadoop-hdfs-over-http/</link>
		<comments>http://www.cloudera.com/blog/2011/07/hoop-hadoop-hdfs-over-http/#comments</comments>
		<pubDate>Wed, 20 Jul 2011 20:44:21 +0000</pubDate>
		<dc:creator>Alejandro Abdelnur</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=8306</guid>
		<description><![CDATA[What is Hoop? Hoop provides access to all Hadoop Distributed File System (HDFS) operations (read and write) over HTTP/S. Hoop can be used to: Access HDFS using HTTP REST. Transfer data between clusters running different versions of Hadoop (thereby overcoming RPC versioning issues). Access data in a HDFS cluster behind a firewall. The Hoop server [...]]]></description>
			<content:encoded><![CDATA[<h2>What is Hoop?</h2>
<p>Hoop provides access to all Hadoop Distributed File System (HDFS) operations (read and write) over HTTP/S.</p>
<p>Hoop can be used to:</p>
<div style="margin-left: 20px">
<ul>
<li>Access HDFS using HTTP REST.</li>
<li>Transfer data between clusters running different versions of Hadoop (thereby overcoming RPC versioning issues).</li>
<li>Access data in a HDFS cluster behind a firewall. The Hoop server acts as a gateway and is the only system that is allowed to go through the firewall.</li>
</ul>
</div>
<p>Hoop has a Hoop client and a Hoop server component:</p>
<div style="margin-left: 20px">
<ul>
<li>The Hoop server component is a REST HTTP gateway to HDFS supporting all file system operations. It can be accessed using standard HTTP tools (i.e. curl and wget), HTTP libraries from different programing languages (i.e. Perl, JavaScript) as well as using the Hoop client. The Hoop server component is a standard Java web-application and it has been implemented using Jersey (JAX-RS).</li>
<li>The Hoop client component is an implementation of Hadoop FileSystem client that allows using the familiar Hadoop filesystem API to access HDFS data through a Hoop server. </li>
</ul>
</div>
<h2>Hoop and Hadoop HDFS Proxy</h2>
<p>Hoop server is a full rewrite of <a href="http://hadoop.apache.org/hdfs/docs/r0.21.0/hdfsproxy.html" target="_about">Hadoop HDFS Proxy</a>. Although it is similar to Hadoop HDFS Proxy (runs in a servlet-container, provides a REST API, pluggable authentication and authorization), Hoop server improves many of Hadoop HDFS Proxy shortcomings by providing:</p>
<div style="margin-left: 20px">
<ul>
<li>Support for all HDFS operations (read, write, status).</li>
<li>Cleaner HTTP REST API.</li>
<li>JSON format for status data (files status, operations status, error messages).</li>
<li>Kerberos HTTP SPNEGO client/server authentication and pseudo authentication out of the box (using <a href="http://cloudera.github.com/alfredo/docs/latest/index.html">Alfredo</a>).</li>
<li>Hadoop proxy-user support.</li>
<li>Tools such as DistCP could run on either cluster.</li>
</ul>
</div>
<h2>Accessing HDFS files -via Hoop- using Unix &#8216;curl&#8217; command</h2>
<p>Assuming Hoop is running on http://hoopbar:14000, the following examples show how the Unix &#8216;curl&#8217; command can be used to access data in HDFS via Hoop using pseudo authentication.</p>
<p>Getting the home directory:</p>
<pre class="code" style="padding-left: 30px">$ curl -i "http://hoopbar:14000?op=homedir&amp;user.name=babu"
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
{"homeDir":"http:\/\/hoopbar:14000\/user\/babu"}
$</pre>
<p style="padding-top: 8px">Reading a file:</p>
<pre class="code" style="padding-left: 30px">$ curl -i "http://hoopbar:14000?/user/babu/hello.txt&amp;user.name=babu"
HTTP/1.1 200 OK
Content-Type: application/octet-stream
Transfer-Encoding: chunked
Hello World!
$</pre>
<p style="padding-top: 8px">Writing a file:</p>
<pre class="code" style="padding-left: 30px">$ curl -i -X POST "http://hoopbar:14000/user/babu/data.txt?op=create" --data-binary @mydata.txt --header "content-type: application/octet-stream"
HTTP/1.1 200 OK
Location: http://hoopbar:14000/user/babu/data.txt
Content-Type: application/json
Content-Length: 0
$</pre>
<p style="padding-top: 8px">Listing the contents of a directory:</p>
<pre class="code" style="padding-left: 30px">$ curl -i "http://hoopbar:14000?/user/babu?op=list&amp;user.name=babu"
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked

[
  {
    "path" : "http:\/\/hoopbar:14000\/user\/babu\/data.txt"
    "isDir" : false,
    "len" : 966,
    "owner" : "babu",
    "group" : "supergroup",
    "permission" : "-rw-r--r--",
    "accessTime" : 1310671662423,
    "modificationTime" : 1310671662423,
    "blockSize" : 67108864,
    "replication" : 3
  }
]
$</pre>
<p style="padding-top: 8px">Click this link for more details about the <a href="http://cloudera.github.com/hoop/docs/latest/HttpRestApi.html" target="_about">Hoop HTTP REST API</a>.</p>
<h2>Getting Hoop</h2>
<p>Hoop is distributed with an Apache License 2.0.</p>
<p>The source code is available at <a href="http://github.com/cloudera/hoop" target="_about">http://github.com/cloudera/hoop</a>.</p>
<p>Instructions on how to build, install and configure Hoop server and the rest of&#160;documentation is available at&#160;<a href="http://cloudera.github.com/hoop" target="_about">http://cloudera.github.com/hoop</a>.</p>
<h2>Contributing Hoop to Apache Hadoop</h2>
<p>The goal is to contribute Hoop to Apache Hadoop as the next generation of Hadoop HDFS proxy. We are just waiting on the Mavenization of Hadoop Common and Hadoop HDFS which will make integration easier.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/07/hoop-hadoop-hdfs-over-http/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Hadoop Availability</title>
		<link>http://www.cloudera.com/blog/2011/02/hadoop-availability/</link>
		<comments>http://www.cloudera.com/blog/2011/02/hadoop-availability/#comments</comments>
		<pubDate>Thu, 10 Feb 2011 16:00:11 +0000</pubDate>
		<dc:creator>Eli Collins</dc:creator>
				<category><![CDATA[CDH]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[hadoop availability]]></category>
		<category><![CDATA[hadoop overview]]></category>
		<category><![CDATA[hadoop progress]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=6505</guid>
		<description><![CDATA[A common question on the Apache Hadoop mailing lists is what&#8217;s going on with availability? This post takes a look at availability in the context of Hadoop, gives an overview of the work in progress and where things are headed. Background When discussing Hadoop availability people often start with the NameNode since it is a [...]]]></description>
			<content:encoded><![CDATA[<p>A common question on the <a href="http://hadoop.apache.org/mailing_lists.html">Apache Hadoop mailing lists</a> is what&#8217;s going on with availability? This post takes a look at availability in the context of Hadoop, gives an overview of the work in progress and where things are headed.</p>
<h3>Background</h3>
<p>When discussing Hadoop availability people often start with the NameNode since it is a <a id="j.bf" title="single point of failure" href="http://en.wikipedia.org/wiki/Single_point_of_failure">single point of failure</a> (SPOF) in HDFS, and most components in the Hadoop ecosystem (MapReduce, HBase, Pig, Hive etc) rely on HDFS directly, and are therefore limited by its availability. However, Hadoop availability is a larger, more general issue, so it&#8217;s helpful to establish some context before diving in.</p>
<p>Availability is the proportion of time a system is functioning [1], which is commonly referred to as &#8220;uptime&#8221; (vs downtime, when the system is not functioning).</p>
<p>Note that availability is a stricter requirement than fault tolerance &#8211; the ability for a system to perform as designed and degrade gracefully in the presence of failures. A system that requires an hour to restart (eg for a configuration change or software upgrade) but has no single point of failure is fault tolerant but not highly available (HA).  Adding redundancy in all SPOFs is a common way to improve fault tolerance, which helps [2], but is just a part of, improving Hadoop availability. Note also that fault tolerance is distinct from durability, even though the NameNode is a SPOF no single failure results in data loss as copies of NameNode persistent state (the image and edit log) are replicated both within and across hosts.</p>
<p>Availability is also often conflated with reliability. Reliability in distributed systems is a more general issue than availability [3]. A truly reliable distributed system must be highly available, fault tolerant, secure, scalable, and perform predictably, etc.  I&#8217;ll limit this post to Hadoop availability.</p>
<h3>Reasons for downtime</h3>
<p>An important part of improving availability and articulating requirements is understanding the causes of downtime. There are many types of failures in distributed systems, ways to classify them, and analyses of how failures result in downtime. Rather than go into depth here, I&#8217;ll briefly summarize some general categories of issues that may cause downtime:</p>
<p><strong>1. Maintenance</strong> &#8211; Hardware and software may need to be upgraded, configuration changes may require a system restart, and operational tasks for dependent systems. Hadoop can handle most maintenance to <a href="http://hadoop.apache.org/common/docs/current/cluster_setup.html#Slaves">slave hosts</a> without downtime; however maintenance to a master host normally requires a restart of the entire system.</p>
<p><strong>2. Hardware failures</strong> &#8211; Hosts and their connections may fail. Without redundant devices, or redundant components within devices, a single component failure may cause the entire device to fail. Hadoop can tolerate hardware failures (even silent failures like corruption) to slave hosts without downtime; however some hardware failures on the master host (or a failure in the connection between the master and the majority of the slaves) can cause system downtime [4].</p>
<p><strong>3. Software failures</strong> &#8211; Software bugs may cause a component in the system to stop functioning or require a restart. For example, a bug in upgrade code could result in downtime due to data corruption. A dependent software component may become unavailable (eg the Java garbage collector enters a stop-the-world phase). Hadoop can tolerate some software bugs without downtime; however components are generally designed to <a id="d_yo" title="fail-fast" href="http://en.wikipedia.org/wiki/Fail-fast">fail-fast</a> &#8211; to stop and notify other components of failure rather than attempt to continue a possibly-flawed process. Therefore a software bug in a master service will likely cause downtime.</p>
<p><strong>4. Operator errors</strong> &#8211; People make mistakes. From disconnecting the wrong cable, to mis-configured hosts, to typos in configuration files, operator errors can cause downtime. Hadoop attempts to limit operator error by simplifying administration, validating its configuration, and providing useful messages in logs and UI components; however operator mistakes may still cause downtime.</p>
<h3>Use cases</h3>
<p>In order for a system to be highly available, its design needs to anticipate these various failures. Removing single points of failure, enabling rolling upgrades, faster restarts, making the system robust and user friendly, etc are all necessary to improve availability. Given that improving availability requires a multi-prong approach, let&#8217;s take a look at the relevant use cases for limiting downtime.</p>
<p><strong>1. Host maintenance</strong> &#8211; If an operator needs to upgrade or replace the primary host hardware or upgrade its operating system, they should be able to manually fail over to a <a id="j_1t" title="hot standby" href="http://en.wikipedia.org/wiki/Hot_spare">hot standby</a>, perform the upgrade and optionally fail back to the primary. The fail-over should be transparent to clients accessing the system (eg active jobs continue to run). Host maintenance to slave hosts can be handled without downtime today by <a id="nmzh" title="de-commissioning" href="http://hadoop.apache.org/hdfs/docs/current/hdfs_user_guide.html">de-commissioning</a> the host.</p>
<p><strong>2. Configuration changes</strong> &#8211; Ideally configuration changes to masters should not require a system restart &#8212; the configuration can be updated in-place or fail-over to a hot standby with an updated configuration is supported. In cases when they do, the operator should be able to restart the system with minimal impact to running workloads.</p>
<p><b>3. Software upgrades</b> &#8211; An operator should be able to upgrade Hadoop&#8217;s software in-place (a &#8220;rolling upgrade&#8221;) on slave nodes and via fail-over on the master hosts so there is little or no downtime. If a restart is required it should be accelerated by quickly re-constructing the system&#8217;s state.</p>
<p><strong>4. Host failures</strong> &#8211; If a non-redundant hardware component fails, the operating system crashes, a disk partition runs out of space, etc. the system should detect the failure, and, depending on the service and failure, (a) recover, (b) de-commission itself, or (c) fail over to a hot standby. Hadoop currently tolerates slave host failures without downtime, however master host failures often cause downtime. In practice, for a number of reasons, master hardware failures do not cause as much downtime as you might expect:</p>
<div style="margin-left:20px">
<ul>
<li>In large clusters it is statistically improbable that a hardware failure impacts a machine running master services, and o<span style="font-size: small;">perations teams are often good at keeping a small number of well-known hosts healthy.</li>
<li>Because there are few master hosts redundant hardware components can be used to limit the probability of a host failure without dramatically increasing the price of the overall system.</li>
</ul>
</div>
<h3>Highly Available Hadoop</h3>
<p>A number of efforts are under way to improve Hadoop availability, and implement missing functionality required by the above use cases. Tasks related to HDFS availability are <a id="xk.7" title="tracked here" href="https://issues.apache.org/jira/browse/HDFS-1064">tracked here</a>, tasks related to MapReduce availability are <a id="ntev" title="tracked here" href="https://issues.apache.org/jira/browse/MAPREDUCE-2288">tracked here</a>.</p>
<p><strong>1.</strong> Improvements to Hadoop&#8217;s failure handling code. Hadoop&#8217;s native <a id="sd.5" title="fault injection frameworks" href="http://hadoop.apache.org/hdfs/docs/current/faultinject_framework.html">fault injection framework</a> and <a id="k-03" title="other frameworks" href="https://github.com/toddlipcon/gremlins#readme">other related frameworks</a> continue to make Hadoop more robust in the face of failures. Recent advances in failure testing applied have been successfully applied to Hadoop [5] to identify software bugs (eg <a id="nqtm" title="HDFS-1231" href="https://issues.apache.org/jira/browse/HDFS-1231">HDFS-1231</a>, <a id="b4m-" title="HDFS-1225" href="https://issues.apache.org/jira/browse/HDFS-1225">HDFS-1225</a>, and <a id="r8x:" title="HDFS-1228" href="https://issues.apache.org/jira/browse/HDFS-1228">HDFS-1228</a>).</span></p>
<p><strong>2.</strong> Work was recently started to allow <a id="f1mv" title="Hadoop configuration changes without restart" href="https://issues.apache.org/jira/browse/HADOOP-7001">Hadoop configuration changes without restart</a>. As Hadoop <a id="gxty" title="incorporates this change" href="https://issues.apache.org/jira/browse/HDFS-1477">incorporates this change</a> configuration parameter changes will increasingly be possible without downtime.
 </p>
<p><strong>3.</strong> A number of changes (eg <a id="wld7" title="HDFS-1070" href="https://issues.apache.org/jira/browse/HDFS-1070">HDFS-1070</a>, <a id="hy1e" title="HDFS-1295" href="https://issues.apache.org/jira/browse/HDFS-1295">HDFS-1295</a>, and <a id="kjsd" title="HDFS-1391" href="https://issues.apache.org/jira/browse/HDFS-1391">HDFS-1391</a>) are underway to significantly improve the time it takes to restart HDFS.</p>
<p><strong>4.</strong> <a id="dz:d" title="Work has started" href="https://issues.apache.org/jira/browse/HADOOP-6904">Work has started</a> to allow Hadoop client and server software of different versions to co-exist, with the goal of enabling in-place Hadoop software upgrades.</p>
<p><strong>5.</strong> There have been efforts to make existing releases of HDFS <a id="vuot" title="more highly-available" href="http://www.cloudera.com/blog/2009/07/hadoop-ha-configuration">more highly-available</a>, as well as several research prototypes (eg <a id="gyzv" title="UpRight-HDFS" href="http://code.google.com/p/upright/wiki/HDFSUpRightOverview">UpRight-HDFS</a>, <a id="zw1c" title="NameNode Cluster" href="http://gnawux.info/hadoop/2010/01/pratice-of-namenode-cluster-for-hdfs-ha">NameNode Cluster</a>, and <a id="a4b-" title="HDFS-dnn" href="http://code.google.com/p/hdfs-dnn">HDFS-dnn</a>) that examine HDFS availability. HDFS developers are currently working on a hot standby for the NameNode to improve on the existing <a id="cf-p" title="NameNode fail-over" href="http://wiki.apache.org/hadoop/NameNodeFailover">NameNode fail-over</a>. Like the <a id="tn6q" title="Google File System's" href="http://labs.google.com/papers/gfs.html">Google File System</a> which has &#8220;shadow masters&#8221;, this allows HDFS to fail the NameNode process from one host to another by actively replicating all NameNode state required to quickly restart the process. Integrating the <a id="k5pw" title="BackupNode" href="http://hadoop.apache.org/hdfs/docs/current/hdfs_user_guide.html#Backup+Node">BackupNode</a> (edits are streamed from the primary NameNode to one or more BackupNodes) or using <a id="yrwj" title="integrating BookKeeper" href="https://issues.apache.org/jira/browse/HDFS-234">BookKeeper</a> (a replicated service to reliably log streams of records that can be used for the edits log) with the <a id="tywb" title="AvatarNode" href="http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html">AvatarNode</a> (which replicates block reports across a primary and backup host) results in a standby NameNode that can be activated if the NameNode fails. Automatic hot fail-over can be achieved by integrating both clients and servers with <a id="yuh:" title="ZooKeeper" href="http://hadoop.apache.org/zookeeper">ZooKeeper</a>. A similar approach has been successfully used by Google to make GFS highly available [6]:</p>
<blockquote><p>Initially, GFS had no provision for automatic master failover. It was a manual process. Although it didn&#8217;t happen a lot, whenever it did, the cell might be down for an hour. Even our initial master-failover implementation required on the order of minutes. Over the past year, however, we&#8217;ve taken that down to something on the order of tens of seconds.</p></blockquote>
<p>This Active/Passive design enables both high availability and evolution towards being a better storage layer for systems like <a id="ez96" title="HBase" href="http://hbase.apache.org/">HBase</a>, which in turn could be used to store metadata for a new version of HDFS (similar to <a id="onl:" title="Google's Colossus" href="http://queue.acm.org/detail.cfm?id=1594206">Google&#8217;s GFS2</a>). Like <a id="l97v" title="HDFS federation" href="https://issues.apache.org/jira/browse/HDFS-1052">HDFS federation</a>, this provides an evolutionary path to high scalability without the complexity of modifying HDFS to use <a id="jf:c" title="multi-master replication" href="http://en.wikipedia.org/wiki/Multi-master_replication">multi-master replication</a>.</p>
<p><strong>6.</strong> The MapReduce master (JobTracker) state is <a id="ohnp" title="available in HDFS" href="https://issues.apache.org/jira/browse/MAPREDUCE-863">stored in HDFS</a> and is therefore limited by its availability. The JobTracker can be-restarted, however works needs to be to integrate it with a service like ZooKeeper to handle fail-over to a separate host.</p>
<p>Hopefully this post has helped frame the various tasks behind Hadoop&#8217;s march towards high availability in a useful context. The development community understands this is one of the most high priority issues for users, and is looking forward to providing a highly available Hadoop in up-coming releases. Similarly, Cloudera is committed to improving availability in CDH4 &#8211; it&#8217;s our primary focus for the release.</p>
<p><i>Thanks to Dhruba Borthakur, Doug Cutting, Sanjay Radia, and Konstantin Shvachko for reading drafts of this post.</i></p>
<hr />
<h3>Footnotes</h3>
<p>[1] A common way to define availability is the ratio of the expected value of the uptime of the system to the aggregate of the expected values of uptime and downtime. Common metrics used are:</p>
<ul style="margin-left:20px">
<li>Mean time between failures (MTBF) &#8211; the expected time between failures of a system during operation.</li>
<li>Mean time to recovery (MTTR) &#8211; the average time the system will take to recover from any failure.</li>
</ul>
<p>Using these metrics availability can be defined as MTBF / (MTBF + MTTR).</p>
<p>[2] I say &#8220;helps&#8221; because one the most common reasons for downtime (misconfiguration, operator error, and software bugs) are all exacerbated by system complexity, and making systems more fault tolerant often increases their complexity.</p>
<p>[3] &#8220;Reliability and availability are different: Availability is doing the right thing within the specified response time. Reliability is not doing the wrong thing.&#8221; from WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT? by Jim Gray</p>
<p>[4] People often configure master hosts with redundant hardware components (nics, disks, IO controllers, and power units) so that an individual component failure does not cause the system to fail.</p>
<p>[5] <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html">Towards Automatically Checking Thousands of Failures with Micro-specifications</a>. H Gunawi, T Do, et al. UC Berkeley TR EECS-2010-98.</p>
<p>[6] <a href="http://queue.acm.org/detail.cfm?id=1594206">GFS: Evolution on Fast-Forward</a>. Marshall Kirk McKusick, Sean Quinlan in the ACM Queue.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/02/hadoop-availability/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>CDH2 Update 3 Now Available</title>
		<link>http://www.cloudera.com/blog/2011/01/cdh2-update-3-now-available/</link>
		<comments>http://www.cloudera.com/blog/2011/01/cdh2-update-3-now-available/#comments</comments>
		<pubDate>Fri, 28 Jan 2011 16:00:45 +0000</pubDate>
		<dc:creator>Eli Collins</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[cdh apache hadoop]]></category>
		<category><![CDATA[CDH update]]></category>
		<category><![CDATA[cdh2 update]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=6244</guid>
		<description><![CDATA[Cloudera is happy to announce the availability of the third update to version 2 of our distribution for Apache Hadoop (CDH2). CDH2 Update 3 contains a number of important fixes like HADOOP-5203, HDFS-1377, MAPREDUCE-1699, MAPREDUCE-1853, and MAPREDUCE-270. Check out the release notes and change log for more details on what&#8217;s in this release. You can [...]]]></description>
			<content:encoded><![CDATA[<p>Cloudera is happy to announce the availability of the third update to version 2 of our distribution for Apache Hadoop (CDH2). CDH2 Update 3 contains a number of important fixes like <a href="https://issues.apache.org/jira/browse/HADOOP-5203">HADOOP-5203</a>, <a href="https://issues.apache.org/jira/browse/HDFS-1377">HDFS-1377</a>, <a href="https://issues.apache.org/jira/browse/MAPREDUCE-1699">MAPREDUCE-1699</a>, <a href="https://issues.apache.org/jira/browse/MAPREDUCE-1853">MAPREDUCE-1853</a>, and <a href="https://issues.apache.org/jira/browse/MAPREDUCE-270">MAPREDUCE-270</a>. Check out the <a href="http://archive.cloudera.com/cdh/2/hadoop-0.20.1+169.127.releasenotes.html">release notes</a> and <a href="http://archive.cloudera.com/cdh/2/hadoop-0.20.1+169.127.CHANGES.txt">change log</a> for more details on what&#8217;s in this release. You can find the packages and tarballs on our website, or simply update your systems if you are already using our repositories. More instructions can be found in our <a href="https://docs.cloudera.com/display/DOC/Cloudera+Documentation+Home+Page">CDH documentation</a>.</p>
<p>We appreciate feedback! Get in touch with us on <a href="https://groups.google.com/a/cloudera.org/group/cdh-user">the CDH user list</a>, <a href="http://twitter.com/cloudera">twitter</a> or IRC (#cloudera on freenode.net) and let us know how the update is working for you.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/01/cdh2-update-3-now-available/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Using Hadoop for Fraud Detection and Prevention</title>
		<link>http://www.cloudera.com/blog/2010/08/hadoop-for-fraud-detection-and-prevention/</link>
		<comments>http://www.cloudera.com/blog/2010/08/hadoop-for-fraud-detection-and-prevention/#comments</comments>
		<pubDate>Wed, 25 Aug 2010 05:27:20 +0000</pubDate>
		<dc:creator>Alex Kozlov</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[fraud]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4478</guid>
		<description><![CDATA[Learn about fraud and how to prevent it with Hadoop]]></description>
			<content:encoded><![CDATA[<p>Fraud has multiple meanings and the term can be easily abused.&#160; The definition of fraud has undergone multiple changes throughout the years and is elusive as well as fraud itself.&#160; The modern legal definition of fraud usually contains a few elements that have to be proven in court and depends on the state/country.&#160; For example, in California, the elements of fraud, which give rise to the fraud cause of action in the California Courts, are: (a) misrepresentation (false representation, concealment, or nondisclosure); (b) knowledge of falsity (or scienter); (c) intent to defraud, i.e., to induce reliance; (d) justifiable reliance; and (e) resulting damage.&#160; A more general definition may contain up to <a href="http://en.wikipedia.org/wiki/Fraud#Elements_of_fraud">9 elements</a>.</p>
<p>
From the statistical or technical perspective, fraud is a rare event that results in a significant financial impact to the organization.</p>
<p>
Both definitions emphasize that the event is rare (assuming that most of the population is law-abiding citizens), is intentional (there is no &#8220;accidental&#8221; fraud), as well as imply a significant damage caused to the defrauded party (otherwise why bother).&#160; Fraud detection is difficult from statistical point of view for exactly these reasons: (a) the events are rare and it is difficult to build a predictive model and (b) fraud assumes a real human being behind it and incorporates elements of game theory since the fraudster is often an insider who knows how to game the system.</p>
<p><h3>Fraud and Rare Events</h3>
<p>By definition, fraud is an unexpected or rare event with significant financial or other damage.&#160; Fraud assumes that the fraudster has some prior information how the current system works including previous successful and unsuccessful fraud cases and possibly the fraud detection mechanisms.&#160; The above breaks the standard statistical modeling assumption, the variable independence or i.i.d. assumption, making building a reliable statistical model difficult.&#160; Often the fraudster is working in the same industry that the fraud detection is supposed to protect, is intimately familiar with the fraud detection methods, and is actively trying to avoid detection by masquerading.</p>
<p>
Rare event detection problem is also applicable to online advertising and marketing, particularly with predicting &#8220;long tail&#8221; events and terrorism detection.</p>
<p>
One common example of fraud is associated with <a href="http://en.wikipedia.org/wiki/Taleb_distribution" target="_blank">Taleb distribution</a> where a seemingly high probability of a small gain shadows a small probability of a large loss that more than outweighs the gains.&#160; Relatively long periods of slightly better than moderate gains are interrupted by a rare event of large losses.&#160; It is easy to defraud investors by presenting the results of partial analysis excluding the &#8220;rare events&#8221;.</p>
<p><h3>Fraud Prevention</h3>
<p>Since fraud is so hard to prove in courts, most organizations and individuals try to prevent fraud from happening by blanket measures.&#160; This includes limiting the amount of damage the fraudster can impact on the organization as well as early detection of fraud patterns.&#160; For example, credit card companies can cut the credit card limit across the board in anticipation of a few negative fraud cases.&#160; Advertisers can prevent advertising campaigns with low number of qualifying events.&#160; And anti-terrorism agencies can prevent people with bottles of pure water from boarding the planes.&#160; These actions are often in contrast with the company efforts to attract more customers and result in general dissatisfaction.&#160; To the rescue are new technologies like Hadoop, Influence Diagrams and Bayesian Networks which are computationally expensive (these are NP-hard in computer science terminology) but are more accurate and predictive.</p>
<p><h3>Why Hadoop?</h3>
<p>Hadoop is a distributed system for processing large amounts of data.&#160; In a recent Hadoop Summit 2010 Yahoo, Facebook, and other companies announced that they currently process a few TBs of data per day and the volumes are <a href="http://developer.yahoo.net/blogs/theater/archives/2009/06/hadoopsummit_omalley.html" target="_blank">growing at exponential rates</a>.&#160; Hadoop can be vital for solving the fraud detection problem because:</p>
<ol>
<li>Sampling      does not work for rare events since the chance of missing a positive fraud      case leads to significant deterioration of model quality.</li>
<li>Hadoop      can solve much harder problems by leveraging multiple cores across      thousands of machines and search through much larger problem domains.</li>
<li>Hadoop      can be combined with other tools to manage moderate to low response      latency requirements.</li>
</ol>
<p>
Let&#8217;s go through these reasons one by one.&#160; Sampling is a common technique for modeling rare events.&#160; One of the problems with sampling is that we cannot afford to throw away rare positive cases.&#160; Even in a stratified or proportional sampling scheme one has to retain all positive cases since the model accuracy heavily depends on them (one can usually discard some negative cases though).&#160; Given the above, the system still has to go through the whole dataset to sieve through the positive and negative cases.</p>
<p>
Hadoop is known for its gnawing power.&#160; Nothing can compare with the throughput power of thousands of machines each of which has multiple cores.&#160; As was reported recently at the Hadoop Summit 2010, the largest installations of Hadoop have 2,000 to 4,000 computers with 8 to 12 cores each, amounting to up to 48,000 active threads looking for a pattern at the same time.&#160; This allows either (a) looking through larger periods of time to incorporate events across a larger time frame or (b) taking more sources of information into account. &#160;It is quite common among social network companies to comb through twitter blogs in search of relevant data.</p>
<p>
Finally, one of the fraud prevention problems is latency.&#160; The agencies want to react to an event as soon as possible, often within a few minutes of the event.&#160; Yahoo recently reported that it can adjust its behavioral model in a response to a user click event within 5-7 minutes across several hundred of millions of customers and billions of events per day.&#160; Cloudera has developed a tool, Flume, that can load billions of events into HDFS within a few seconds and analyze them using MapReduce.</p>
<p>
Often fraud detection is akin to &#8220;finding a needle in a haystack&#8221;.&#160; One has to go through mountains of relevant and seemingly irrelevant information, build dependency models, evaluate the impact and thwart the fraudster actions.&#160; Hadoop helps with finding patterns by processing mountains of information on thousands of cores in a relatively short amount of time.</p>
<p><h3>Where to look next?</h3>
<p>Techniques for fraud detection are industry-specific as a rule and often are guarded since they obviously represent valuable information for potential fraudsters.&#160; They are often kept confidential for this reason.&#160; Moreover, the fraud detection techniques are usually a moving target since the fraudsters quickly adjust to the new fraud detection mechanisms.</p>
<p>
One of the most publicized technical frauds is click fraud in on-line advertising.&#160; Since advertisers are often charged on the per-click basis &#8212; so called PPC campaigns; there is a way to charge advertisers on a per-conversion basis, which we will cover shortly, but a different type of fraud emerges there where the advertiser tries to conceal the conversions &#8212; the traffic provider like a search web site has a clear incentive to inflate the number.&#160; Additionally, an advertiser competitor may be incentivized to inflate the number to skew the original advertiser margin.&#160; This can be achieved by a human or software agent that generates extra traffic and clicks on the competitor site.&#160; Fraud management companies like <a href="http://www.fraudwall.com/" target="_blank">Anchor Intelligence</a> and <a href="http://www.clickforensics.com/" target="_blank">Click Forensics</a> estimate that approximately 20% to 30% of all clicks are fraud.&#160; How do we know that a click is a fraud?</p>
<p>
Decline in the number of conversions &#8212; first and most important, if your conversion rate is normally positive (that is, you are making a profit on your ad), and all of a sudden, conversion dives into negative numbers, this could be a sign of click fraud in action.&#160; Click fraud causes extra clicks on your ad with no actual purchases, and your conversion rate will fall accordingly.</p>
<p>
An abnormal number of clicks from the same IP address or a pattern in the access times &#8212; although this is the most obvious and easily identified form of click fraud, it is amazing how many fraudsters still use this method, particularly for quick attacks.&#160; They may choose a to strike over a long weekend when they figure you may not be watching your log files carefully, clicking on your ad repeatedly so that when you return to work on Tuesday, your account is significantly depleted.&#160; Part of this fraud might be unintentional when a user tries to reload a page.</p>
<p>
Large &#8220;abandonment rate&#8221;, or numbers of visitors who leave your site quickly &#8212; another indication of click fraud can be a pattern of visitors clicking on your ad, spending the minimum amount of time on your site required by your PPC search engine to establish it as a valid click (usually 30 seconds or more), and then leaving without having left the landing page at all.</p>
<p>
A large number of impressions, without the follow-through clicks or click on your ad &#8212; if you notice that there are a lot more impressions (views) of your website; this could indicate the impression fraud we discussed earlier. Artificial inflation of your ad impressions may cause your clickthrough rates to drop below the Google minimum, and your ad will be disabled.&#160; Until you realize this, your competitors have free reign to use your keywords, sometimes at bargain prices.&#160; As well, your relevancy ratings for search engines may drop as they record numerous impressions, but no interest shown via visits to other parts of your website, which could lead to a shutdown of your campaign.</p>
<p>
Abnormally high clicks and impressions on affiliate websites &#8212; although affiliates themselves are sometimes involved in conducting click fraud schemes, they can be victims of click fraud themselves.&#160; If one of their competitors uses this same method of excessive clicks and impressions on an affiliate&#8217;s site, the PPC search engine will soon notice an abnormally high payment to a certain affiliate and perhaps go as far as canceling that affiliate&#8217;s account, even though he or she was not engaging in any form of click fraud.</p>
<p>
A large number of clicks coming from countries outside of your normal market area &#8212; using IP geo-location services, you can identify which country an IP address is probably coming from.</p>
<p>
In the case of performance-based advertising, the advertiser himself is interested in concealing some of the traffic, not inflating it.&#160; Since most of the performance-based measurements is based in beacons or pixels placed on the advertiser conversion page, advertiser has an incentive to (temporarily) block the traffic from the beacon or to completely remove it from their web-site.</p>
<p>
Fraud is prevalent in telecom industry.&#160; One of the leading commercially available fraud detection products is <a href="http://h20208.www2.hp.com/cms/solutions/ci-b/cv/frm.jsp" target="_blank">HP FMS system</a> on which the author had a pleasure to work personally.&#160; The types of telecom fraud include:</p>
<p>
Subscription fraud &#8212; involves the acquisition of telecommunications services using stolen or false credentials and/or identity with no intention of paying. With subscription fraud, not only do service providers lose revenue, but also individual consumers are vulnerable to having their identity stolen and credit rating tarnished.</p>
<p>
Technical/network fraud &#8212; occurs when someone uses equipment or technology to gain access to a service without paying. Fraudulent calls are typically billed to the legitimate owner of the line or service.&#160; Wireless examples include cloning of cell phones or subscriber identity module (SIM) cards. Fixed line examples include clip on or line tapping, private branch exchange (PBX) hacking and calling card fraud. Prepaid services also have a large exposure to fraud with terminal tampering via magnetic strips or SIM chips, or recharging with stolen credit card numbers.</p>
<p>
Insider fraud &#8212; occurs when individuals inside the operator provide fraudulent access to networks or otherwise thwart the ability of the operator to be paid for services used.</p>
<p>
Handset abuse &#8212; is what takes place when stolen or lost handsets are used to consume telecommunications services that are in turn paid for by the service provider.&#160; This is an expensive liability for carriers who absorb the costs.</p>
<p>
Social engineering &#8212; is an effective fraud technique in which people unwittingly help perpetrators by providing sensitive data, illicit access or simply forwarding their calls without ever knowing they have done anything wrong.</p>
<p>
All these patterns can be detected with special MapReduce pattern detection techniques.  Flume offers low-latency stream processing capabilities.</p>
<p>
Needless to say, the fraudsters also explore the potential market and invent new innovative ways to generate fraud.&#160; One of them is deployed by <a href="http://www.clickmonkeys.com/about" target="_blank">Click Monkeys</a> which deploys a vessel with animals next to the coast of California to generate seemingly random traffic.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/hadoop-for-fraud-detection-and-prevention/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Hadoop Administrator Training Comes to London</title>
		<link>http://www.cloudera.com/blog/2010/08/london-hadoop-administrative-training-certificatio/</link>
		<comments>http://www.cloudera.com/blog/2010/08/london-hadoop-administrative-training-certificatio/#comments</comments>
		<pubDate>Tue, 24 Aug 2010 15:00:25 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[training]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4417</guid>
		<description><![CDATA[Cloudera&#8217;s Hadoop Training and Certification for System Administrators has made it across the Atlantic to London for the first time! This two-day course covers planning, deploying, maintaining, monitoring, and troubleshooting your Hadoop cluster. We&#8217;ll talk about HDFS, MapReduce, Hive, Pig, HBase, Flume and more, from the System Administrator&#8217;s point of view. Take the certification exam [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify">Cloudera&#8217;s<a href="http://www.eventbrite.com/directory?q=cloudera&amp;loc=london&amp;page=1"> Hadoop Training and Certification</a> for <a href="http://www.eventbrite.com/event/762684209">System Administrators</a> has made it across the Atlantic to London for the first time! This two-day course covers planning, deploying, maintaining, monitoring, and troubleshooting your Hadoop cluster. We&#8217;ll talk about HDFS, MapReduce, Hive, Pig, HBase, Flume and more, from the System Administrator&#8217;s point of view. Take the certification exam at the end of your training and go home with a valuable validation of your Hadoop knowledge.</p>
<p style="text-align: justify">
<p style="text-align: justify">Enter the code &#8220;london_10pct&#8221; when&#160;<a href="http://www.eventbrite.com/event/762684209">registering</a> and receive a 10% discount!</p>
<p style="text-align: center"><a href="http://www.cloudera.com/what-is-hadoop/"><img class="size-medium wp-image-4448 aligncenter" src="http://www.cloudera.com/wp-content/uploads/2010/08/hadoop+elephant_rgb-300x107.png" alt="" width="370" height="130" /></a></p>
<p style="text-align: justify">Hadoop is a rapidly growing field. Prove your expertise by attaining certification from the world&#8217;s foremost Hadoop training and consulting company.</p>
<p style="text-align: justify">.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/london-hadoop-administrative-training-certificatio/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

