<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; community</title>
	<atom:link href="http://www.cloudera.com/blog/category/community/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Thu, 24 May 2012 17:53:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Meet the Presenter: Todd Lipcon</title>
		<link>http://www.cloudera.com/blog/2012/05/meet-the-presenter-todd-lipcon/</link>
		<comments>http://www.cloudera.com/blog/2012/05/meet-the-presenter-todd-lipcon/#comments</comments>
		<pubDate>Mon, 14 May 2012 17:44:41 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[Hadoop Summit]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14877</guid>
		<description><![CDATA[Today’s interview features Todd Lipcon, software engineer for Cloudera. Todd will be presenting Optimizing MapReduce Job Performance at Hadoop Summit. Question: Tell us about your current role and how you interact with Apache Hadoop? Todd: I’m a software engineer on Cloudera’s platform engineering team, where I spend most of my time contributing code to open [...]]]></description>
			<content:encoded><![CDATA[<p>Today’s interview features Todd Lipcon, software engineer for Cloudera. Todd will be presenting <a href="http://hadoopsummit.org/program/#session32" target="_blank"><em>Optimizing MapReduce Job Performance</em></a> at Hadoop Summit.</p>
<h2>Question: Tell us about your current role and how you interact with Apache Hadoop?</h2>
<p><strong>Todd:</strong> I’m a software engineer on Cloudera’s platform engineering team, where I spend most of my time contributing code to open source projects like Apache Hadoop and HBase. Most recently I’ve been implementing the automatic HA failover feature in Hadoop 2.0, but I’ve also spent a lot of time working on understanding and improving performance of the Hadoop stack.</p>
<h2>Question: Tell us about your Hadoop Summit presentation?</h2>
<p><strong>Todd:</strong> At this year’s summit, I will be presenting about the internals of MapReduce and how you can tune your MapReduce jobs for optimal performance. A lot of developers see MapReduce as a black box, but looking inside that box can help you understand where you might have bottlenecks or easy opportunities to improve performance by changing a few configuration parameters.</p>
<h2>Question: What do you expect will be the key takeaway for folks attending your session?</h2>
<p><strong>Todd:</strong> I hope attendees will walk away with a better understanding of each of the phases of MapReduce task execution, and a few key configuration parameters they can play with to get better performance without changing their code.</p>
<h2>Question: What other presentations are you most looking forward to attending?</h2>
<p><strong>Todd:</strong> I’m really looking forward to Josh Wills’ talk on BranchReduce: Distributed Branch-and-Bound on YARN. There are a lot of optimization problems which can be solved by branch-and-bound approaches, and it’s only recently with the introduction of YARN that these types of algorithms can be efficiently built on Hadoop. Not only this a fresh topic, Josh is also an entertaining speaker!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/meet-the-presenter-todd-lipcon/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Apache MRUnit 0.9.0-incubating has been released!</title>
		<link>http://www.cloudera.com/blog/2012/05/apache-mrunit-0-9-0-incubating-has-been-released/</link>
		<comments>http://www.cloudera.com/blog/2012/05/apache-mrunit-0-9-0-incubating-has-been-released/#comments</comments>
		<pubDate>Wed, 02 May 2012 04:38:51 +0000</pubDate>
		<dc:creator>Brock Noland</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[Hadoop Testing]]></category>
		<category><![CDATA[Map Reduce Testing]]></category>
		<category><![CDATA[MapReduce Testing]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14603</guid>
		<description><![CDATA[This post was originally posted on the Apache Software Foundation&#8217;s blog. We (the Apache MRUnit team) have just released Apache MRUnit 0.9.0-incubating (tarball, nexus, javadoc). Apache MRUnit is an Apache Incubator project that is a Java library which helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality [...]]]></description>
			<content:encoded><![CDATA[<p><em>This post was originally posted on the <a href="https://blogs.apache.org/mrunit/entry/apache_mrunit_0_9_0" target="_blank">Apache Software Foundation&#8217;s blog</a>.</em></p>
<p>We (the Apache <abbr title="MapReduce Unit">MRUnit</abbr> team) have just released Apache MRUnit 0.9.0-incubating (<a href="http://www.apache.org/dyn/closer.cgi/incubator/mrunit/" target="_blank">tarball</a>, <a href="https://repository.apache.org/index.html#nexus-search;gav~org.apache.mrunit~~~~" target="_blank">nexus</a>, <a href="http://incubator.apache.org/mrunit/documentation/javadocs/0.9.0-incubating/index.html" target="_blank">javadoc</a>). Apache MRUnit is an Apache Incubator project that is a Java library which helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they&#8217;re deployed to a production system.</p>
<p>The MRUnit project is quite active, 0.9.0 is our fourth release since entering the incubator and we have added 4 new committers beyond the projects initial charter! We are very interested in having new contributors and committers join the project! Please join our <a href="http://incubator.apache.org/mrunit/community/mailing_lists.html" target="_blank">mailing list</a> to find out how you can help!</p>
<p>The MRUnit build process has changed to produce mrunit-0.9.0-hadoop1.jar and mrunit-0.9.0-hadoop2.jar instead of mrunit-0.9.0-hadoop020.jar, mrunit-0.9.0-hadoop100.jar and mrunit-0.9.0-hadoop023.jar. The hadoop1 classifier is for all Apache Hadoop versions based off the 0.20.X line including 1.0.X. The hadoop2 classifier is for all Apache Hadoop versions based off the 0.23.X line including the unreleased 2.0.X.</p>
<p>This <a href="https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12311292&#038;version=12316360" target="_blank">release</a> contains 2 new features, 15 improvements and 6 bug fixes. I will highlight a few below:</p>
<ul>
<li>Support custom counter checking in <a href="https://issues.apache.org/jira/browse/MRUNIT-68" target="_blank">MRUNIT-68</a></li>
<li>runTest() should optionally ignore output order in <a href="https://issues.apache.org/jira/browse/MRUNIT-91" target="_blank">MRUNIT-91</a></li>
<li>Driver.runTest throws RuntimeException should it throw AssertionError in <a href="https://issues.apache.org/jira/browse/MRUNIT-54" target="_blank">MRUNIT-54</a></li>
<li>o.a.h.mrunit.mapreduce.MapReduceDriver should support a combiner in <a href="https://issues.apache.org/jira/browse/MRUNIT-67" target="_blank">MRUNIT-67</a></li>
<li>Better support for other serializations besides Writable:  <a href="https://issues.apache.org/jira/browse/MRUNIT-70" target="_blank">MRUNIT-70</a>,  <a href="https://issues.apache.org/jira/browse/MRUNIT-86">MRUNIT-86</a>,  <a href="https://issues.apache.org/jira/browse/MRUNIT-99" target="_blank">MRUNIT-99</a>,  <a href="https://issues.apache.org/jira/browse/MRUNIT-77" target="_blank">MRUNIT-77</a></li>
<li>Better error messages from validate, null checking and forgetting to set mappers and reducers: <a href="https://issues.apache.org/jira/browse/MRUNIT-74" target="_blank">MRUNIT-74</a>, <a href="https://issues.apache.org/jira/browse/MRUNIT-66" target="_blank">MRUNIT-66</a>, <a href="https://issues.apache.org/jira/browse/MRUNIT-65" target="_blank">MRUNIT-65</a></li>
<li>add static convenience methods to PipelineMapReduceDriver class in <a href="https://issues.apache.org/jira/browse/MRUNIT-89" target="_blank">MRUNIT-89</a></li>
<li>Test and Deprecate Driver.{*OutputFromString,*InputFromString} Methods in <a href="https://issues.apache.org/jira/browse/MRUNIT-48" target="_blank">MRUNIT-48</a></li>
</ul>
<h2 style="font-size:14pt;color:#243543;">Support custom counter checking</h2>
<p>It has always been possible to check the counter values like so:</p>
<pre class="code">assertEquals(2, mapDriver.getCounters().findCounter(CustomMapper.CustomCounter.NAME).getValue());
</pre>
<p>but this is quite tedious. As such Jarek Jarcec Cecho (our second newest committer) added this feature directly to the drivers:</p>
<pre class="code">.withCounter(CustomMapper.CustomCounter.Name, 2);
</pre>
<h2 style="font-size:14pt;padding-top:16px;color:#243543;">runTest() should optionally ignore output order</h2>
<p>Previous to this change MRUnit required Mapper/Reducer classes to output key value pairs in the order specified on the test. Well defined output order is common, but strictly not universal. Dave Beech (our newest committer) contributed a patch so you optionally turn this ordered requirement off by using:</p>
<pre class="code">.runTest(false)
</pre>
<p style="padding-top:12px">instead of</p>
<pre class="code">.runTest()
</pre>
<h2 style="font-size:14pt;line-height:1.3em;padding-top:16px;color:#243543;">Driver.runTest throws RuntimeException should it throw AssertionError</h2>
<p>Previous versions of MRUnit threw a RuntimeException when a test failed. This worked well, but it meant that testing frameworks saw the the test as having erred, not failed. We have changed this to AssertionError so that testing frameworks see the tests as failed. The distinction is small but important.</p>
<h2 style="font-size:14pt;color:#243543;">o.a.h.mrunit.mapreduce.MapReduceDriver should support a combiner</h2>
<p>Previously the MRUnit only supported a combiner in the mapred MapReduceDriver class but now the mapreduce MapReduceDriver also supports a combiner by:</p>
<pre class="code">MapReduceDriver.newMapReduceDriver(mapper, reducer, combiner)</pre>
<p style="padding-top:12px">or</p>
<pre class="code">.withCombiner(combiner) or .setCombiner(combiner)</pre>
<h2 style="font-size:14pt;padding-top:16px;color:#243543;">Better support for other serializations besides Writable</h2>
<p>Previous versions of MRUnit did not support JavaSerialization, Avro or other Serialization frameworks well. We improved alternative serialization support by not forcing K2 in MapReduceDriver to be Comparable and supporting serializations that cannot clone into a object or that do not have default constructors.</p>
<h2 style="font-size:14pt;line-height:1.3em;color:#243543;">Better error messages from validate, null checking and forgetting to set mappers and reducers</h2>
<p>We have improved checking of parameters passed to MRUnit and the error messages when the parameters are invalid including throwing NullPointerException immediately when receiving a null value and throwing a IllegalStateExcpetion when no mapper or reducer class is provided instead of a NullPointerException.</p>
<h2 style="font-size:14pt;color:#243543;">Add static convenience methods to PipelineMapReduceDriver class</h2>
<p>add static convenience constructors similar to those in the other driver classes:</p>
<pre class="code">PipelineMapReduceDriver.newPipelineMapReduceDriver()</pre>
<p style="padding-top:12px">or</p>
<pre class="code">PipelineMapReduceDriver.newPipelineMapReduceDriver(list of Pair<Mapper, Reducer>)</pre>
<h2 style="font-size:14pt;line-height:1.3em;padding-top:16px;color:#243543;">Test and Deprecate Driver.{*OutputFromString,*InputFromString} Methods</h2>
<p>The OutputFromString and InputFromString methods are now deprecated because they required Text inputs or outputs with no way to enforce that the inputs or outputs from a mapper or reducer were actually Text. These methods also provided little convenience as a user can just pass the string they intended to new Text(string)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/apache-mrunit-0-9-0-incubating-has-been-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>HBaseCon 2012: A Glimpse into the Operations Track</title>
		<link>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-operations-track/</link>
		<comments>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-operations-track/#comments</comments>
		<pubDate>Wed, 25 Apr 2012 13:00:03 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HBase Conference]]></category>
		<category><![CDATA[HBase Event]]></category>
		<category><![CDATA[HBaseCon]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14471</guid>
		<description><![CDATA[HBaseCon 2012 is only a month away! The conference takes place May 22 in San Francisco, California and the event is poised to sell out. For those unfamiliar with the Apache HBase project, HBase is open source software that allows for real-time random read/write access to your Big Data in Apache Hadoop with very low [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hbasecon.com/">HBaseCon 2012</a> is only a month away! The conference takes place May 22 in San Francisco, California and the event is poised to sell out.</p>
<div style="float: right; padding-left: 12px; padding-top: 16px;"><a href="http://www.hbasecon.com" target="_blank"><img src="http://www.cloudera.com/wp-content/uploads/2012/03/HBaseCon2012-logo-300px.jpg" alt="HBaseCon 2012" width="200px" /></a></div>
<p>For those unfamiliar with the Apache HBase project, HBase is open source software that allows for real-time random read/write access to your Big Data in Apache Hadoop with very low latency and high scalability. Presentations in the HBaseCon 2012 Operations track will explain the state of HBase today, how to mitigate HBase failures, and best practices in cluster deployment and cluster monitoring.</p>
<h2 style="font-size: 18pt;">Operations Track Presentations</h2>
<p style="padding-top: 8px;"><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">Case Study of HBase Operations at Facebook</span></a><br /> <a href="http://www.hbasecon.com/speakers/ryan-thiessen/">Ryan Thiessen</a>, Facebook</p>
<p>At Facebook we have demanding HBase installations which are used for important and real-time user activity, so failure in an HBase cluster can be a serious issue requiring immediate attention. This session will discuss a variety of real-world scenarios where we have had failures in our HBase systems, how our Operations and Engineering teams have worked to mitigate many of these issues, and where HBase still needs to improve instead of relying on workarounds. The database should never go down. This talk is aimed at developers and other users of HBase (both current and potential) who are interested in an operational perspective on the state of HBase today.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">HBase Backup</span></a><br /> <a href="http://www.hbasecon.com/speakers/sunil-sitaula/">Sunil Sitaula</a>, Cloudera<br /><a href="http://www.hbasecon.com/speakers/madhuwanti-vaidya/">Madhuwanti Vaidya</a>, Facebook</p>
<p>Reliable backup and recovery is one of the main requirements for any enterprise grade applications. HBase has been very well embraced by enterprises needing random, real-time read/write access with huge volumes of data and ease of scalability. As such they are looking for backup solutions that are reliable, easy to use, and can work with existing infrastructure. HBase comes with several backup options but there is a clear need to improve the native export mechanisms. This talk will cover various options that are available out of the box, their drawbacks and what various companies are doing to make backup and recovery efficient. In particular it will cover what Facebook has done to improve performance of backup and recovery process with minimal impact to production cluster.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">HBase Security for the Enterprise</span></a><br /> <a href="http://www.hbasecon.com/speakers/andrew-purtell/">Andrew Purtell</a>, Trend Micro</p>
<p>Trend Micro developed the new security features in HBase 0.92 and has the first known deployment of secure HBase in production. We will share our motivations, use cases, experiences, and provide a 10 minute tutorial on how to set up a test secure HBase cluster and a walk through of a simple usage example. The tutorial will be carried out live on an on-demand EC2 cluster, with a video backup in case of network or EC2 unavailability.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">Lightning Talk | Developing Real Time Analytics Applications Using HBase in the Cloud</span></a><br /> <a href="http://www.hbasecon.com/speakers/rick-tucker/">Rick Tucker</a>, Sproxil</p>
<p>As small companies are adapting to handle Big Data, the cloud and HBase enable developers to leverage that data to provide revenue generating real-time applications. When developing a real-time application for an existing system, one must balance incrementing counters in real-time with MapReduce jobs over the same data-set. When maintaining an analytics platform, ensuring data accuracy is essential. At Sproxil, SMS logs are ingested into HBase at a growing rate and we report metrics such as SMS throughput, unique user growth over time, and return SMS user activity in real time. Sproxil provides a versatile analytics application enabling customers to handpick statistics on demand to gain market insights enabling them to react quickly to trends. This talk will identify the most profitable metrics and demonstrate how to calculate them using Map Reduce while continually updating data as it arrives.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">Lightning Talk | Unique Sets on HBase and Hadoop</span></a><br /> <a href="http://www.hbasecon.com/speakers/elliott-clark/">Elliott Clark</a>, StumbleUpon</p>
<p>Determining the number of unique users that have interacted with a web page, game, or application is a very common use case. HBase is becoming an increasingly accepted tool for calculating sets or counts of unique individuals who meet some criteria. Computing these statistics can range in difficulty from very simple to very difficult. This session will explore how different approaches have worked or not worked at scale for counting uniques on HBase with Hadoop.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">Lightning Talk | Orchestrating Clusters with Ironfan and Chef</span></a><br /> <a href="http://www.hbasecon.com/speakers/robert-berger/">Robert Berger</a>, Runa</p>
<p>This session will discuss how you can represent your complete cluster with one config file and have it deployed to Cloud or Bare Metal. Infochmimps’ Ironfan builds on Opscode Chef to allow you to specify and orchestrate all flavors of your cluster’s deployment, monitoring and growth. Not just the core HBase/HDFS/MapReduce/Hive/Flume, etc. but all the elements including web / app servers, mysql, redis, rabbitmq and whatever other servers needed to implement your service. These same tools can manage variations for development, staging, R&amp;D as well as the target “rendering” to various Clouds, Bare Metal or even Vagrant VMs.</p>
<p><a href="http://hbaseconsf.eventbrite.com/" target="_blank"><img src="http://www.hbasecon.com/wp-content/uploads/2012/02/btn-register-small.png" alt="Register for HBaseCon 2012" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-operations-track/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>HBaseCon 2012: A Glimpse into the Development Track</title>
		<link>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-development-track/</link>
		<comments>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-development-track/#comments</comments>
		<pubDate>Thu, 12 Apr 2012 22:46:41 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[hbase community]]></category>
		<category><![CDATA[HBase Conference]]></category>
		<category><![CDATA[HBase Event]]></category>
		<category><![CDATA[HBaseCon]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14331</guid>
		<description><![CDATA[HBaseCon 2012 is nearly a month away, and if the conference agenda and attendee registration numbers are good indicators, this will be an annual event you won’t want to miss. Apache HBase is an open source software project that provides users with the ability to do real-time random read/write access to their data in Apache Hadoop. [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hbasecon.com/" target="_blank" title="HBaseCon 2012">HBaseCon 2012</a> is nearly a month away, and if the <a href="http://www.hbasecon.com/agenda" title="HBaseCon 2012 Agenda" target="_blank">conference agenda</a> and attendee registration numbers are good indicators, this will be an annual event you won’t want to miss.</p>
<p>Apache HBase is an open source software project that provides users with the ability to do real-time random read/write access to their data in Apache Hadoop. This means that when you want to use Hadoop for real-time data processing, HBase is the project you are looking for. The HBase developer community includes contributors from many organizations such as StumbleUpon, Facebook, Salesforce.com, TrendMicro, eBay, Explorys, Huawei and Cloudera. In fact, the <a href="http://www.hbasecon.com/program-committee" target="_blank" title="HBaseCon 2012 Program Committee">HBaseCon Program Committee</a>, constructors of the <a href="http://www.hbasecon.com/agenda" title="HBaseCon 2012 Agenda" target="_blank">HBaseCon 2012 agenda</a>, are all committers and PMC members of the Apache HBase project.</p>
<div style="float:right;padding-left:12px"><a href="http://www.hbasecon.com" target="_blank"><img src="http://www.cloudera.com/wp-content/uploads/2012/03/HBaseCon2012-logo-300px.jpg" alt="HBaseCon 2012" width="200px" /></a></div>
<p>Presentations in the HBaseCon 2012 Development track will explain how and why HBase is built the way it is and will also cover HBase schema design and HDFS, the file system on which HBase is most commonly deployed. Some of the presentations for this track include the following below.</p>
<h2 style="font-size:16pt">Development Track Presentations</h2>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Learning HBase Internals</span></a><br />
<a href="http://www.hbasecon.com/speakers/lars-hofhansl/">Lars Hofhansl</a>, Salesforce.com</p>
<p>The strength of an open source project resides entirely in its developer community; a strong democratic culture of participation and hacking makes for a better piece of software. The key requirement is having developers who are not only willing to contribute, but also knowledgeable about the project’s internal structure and architecture. This session will introduce developers to the core internal architectural concepts of HBase, not just “what” it does from the outside, but “how” it works internally, and “why” it does things a certain way. We’ll walk through key sections of code and discuss key concepts like the MVCC implementation and memstore organization. The goal is to convert serious “HBase Users” into HBase Developer Users,” and give voice to some of the deep knowledge locked in the committers’ heads.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Lessons learned from OpenTSDB</span></a><br />
<a href="http://www.hbasecon.com/speakers/benoit-sigoure/">Benoit Sigoure</a>, StumbleUpon</p>
<p>OpenTSDB was built on the belief that, through HBase, a new breed of monitoring systems could be created, one that can store and serve billions of data points forever without the need for destructive downsampling, one that could scale to millions of metrics, and where plotting real-time graphs is easy and fast. In this presentation we’ll review some of the key points of OpenTSDB’s design, some of the mistakes that were made, how they were or will be addressed, and what were some of the lessons learned while writing and running OpenTSDB as well as asynchbase, the asynchronous high-performance thread-safe client for HBase. Specific topics discussed will be around the schema, how it impacts performance and allows concurrent writes without need for coordination in a distributed cluster of OpenTSDB instances.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">HBase Schema Design</span></a><br />
<a href="http://www.hbasecon.com/speakers/ian-varley/">Ian Varley</a>, Salesforce.com</p>
<p>Most developers are familiar with the topic of “database design.” In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">HBase and HDFS: Past, Present, and Future</span></a><br />
<a href="http://www.hbasecon.com/speakers/todd-lipcon/">Todd Lipcon</a>, Cloudera</p>
<p>Apache HDFS, the file system on which HBase is most commonly deployed, was originally designed for high-latency high-throughput batch analytic systems like MapReduce. Over the past two to three years, the rising popularity of HBase has driven many enhancements in HDFS to improve its suitability for real-time systems, including durability support for write-ahead logs, high availability, and improved low-latency performance. This talk will give a brief history of some of the enhancements from Hadoop 0.20.2 through 0.23.0, discuss some of the most exciting work currently under way, and explore some of the future enhancements we expect to develop in the coming years. We will include both high-level overviews of the new features as well as practical tips and benchmark results from real deployments.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Lightning Talk | Relaxed Transactions for HBase<br />
<a href="http://www.hbasecon.com/speakers/francis-liu/">Francis Liu</a>, Yahoo!</p>
<p>For Map/Reduce programmers used to HDFS, the mutability of HBase tables poses new challenges: Data can change over the duration of a job, multiple jobs can write concurrently, writes are effective immediately, and it is not trivial to clean up partial writes. Revision Manager introduces atomic commits and point-in-time consistent snapshots over a table, guaranteeing repeatable reads and protection from partial writes. Revision Manager is optimized for a relatively small number of concurrent write jobs, which is typical within Hadoop clusters. This session will discuss the implementation of Revision Manager using ZooKeeper and coprocessors, and paying extra care to ensure security in multi-tenant clusters. Revision Manager is available as part of the HBase storage handler in HCatalog, but can easily be used stand-alone with little coding effort.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Lightning Talk | Living Data: Applying Adaptable Schemas to HBase<br />
<a href="http://www.hbasecon.com/speakers/aaron-kimball/">Aaron Kimball</a>, WibiData</p>
<p>HBase application developers face a number of challenges: schema management is performed at the application level, decoupled components of a system can break one another in unexpected ways, less-technical users cannot easily access data, and evolving data collection and analysis needs are difficult to plan for. In this talk, we describe a schema management methodology based on Apache Avro that enables users and applications to share data in HBase in a scalable, evolvable fashion. By adopting these practices, engineers independently using the same data have guarantees on how their applications interact. As data collection needs change, applications are resilient to drift in the underlying data representation. This methodology results in a data dictionary that allows less-technical users to understand what data is available to them for analysis and inspect data using general-purpose tools (for example, export it via Sqoop to an RDBMS). And because of Avro’s cross-language capabilities, HBase’s power can reach new domains, like web apps built in Ruby.</p>
<p>&nbsp;</p>
<div> </div>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-development-track/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>HBaseCon 2012: A Glimpse into the Applications Track</title>
		<link>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-applications-track/</link>
		<comments>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-applications-track/#comments</comments>
		<pubDate>Wed, 04 Apr 2012 13:00:32 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HBaseCon]]></category>
		<category><![CDATA[Hbasecon sessions]]></category>
		<category><![CDATA[hbasecon talks]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14074</guid>
		<description><![CDATA[HBaseCon 2012 is coming to San Francisco on May 22, less than 2 months away! The conference agenda continues to grow daily with exciting presentation content, which means it’s time to share a few sessions that have been added to the HBaseCon 2012 Applications Track. Apache HBase is primarily used for real-time random read/write access [...]]]></description>
			<content:encoded><![CDATA[<div style="float:left;padding-right:20px"><a href="http://www.hbasecon.com"><img src="http://www.cloudera.com/wp-content/uploads/2012/03/HBaseCon2012-logo-300px.jpg" alt="HBaseCon 2012" width="200px" /></a></div>
<p style="padding-left:220px"><a href="http://www.hbasecon.com/">HBaseCon 2012</a> is coming to San Francisco on May 22, less than 2 months away! The <a href="http://www.hbasecon.com/agenda">conference agenda</a> continues to grow daily with exciting presentation content, which means it’s time to share a few sessions that have been added to the HBaseCon 2012 Applications Track.</p>
<p>Apache HBase is primarily used for real-time random read/write access to Big Data as part of the Hadoop ecosystem. Applications on Apache HBase are typically built to query Big Data with extremely low latency. Sessions in the HBaseCon 2012 Applications Tracks will include explanations of real-world HBase use cases, where HBase fits in an organization’s entire Big Data stack and when HBase is the “right” solution for an organization.</p>
<h2 style="font-size:16pt">Applications Track Presentations</h2>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Building a Large Search Platform on a Shoestring Budget</span></a><br />
<a href="http://www.hbasecon.com/speakers/jacques-nadeau/">Jacques Nadeau</a>, CTO at YapMap</p>
<p>YapMap is a new kind of search platform that does multi-quanta search to better understand threaded discussions. This talk will cover how HBase made it possible for two self-funded guys to build a new kind of search platform. The presentation will discuss the YapMap data model and how YapMap uses row based atomicity to manage parallel data integration problems. Also learn where YapMap does not use HBase and instead uses a traditional SQL based infrastructure; the benefits of using MapReduce and HBase for index generation; the YapMap migration of tasks from a message based queue to the Coprocessor framework; and YapMap’s future Coprocessor use cases. Lastly, learn about YapMap’s operational experience with HBase, hardware choices and the challenges YapMap has faced.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Low Latency “OLAP” with HBase</span></a><br />
<a href="http://www.hbasecon.com/speakers/cosmin-lehene/">Cosmin Lehene</a>, Computer Scientist at Adobe Systems</p>
<p>Adobe Systems uses “SaasBase Analytics” to incrementally process large heterogeneous data sets into pre-aggregated, indexed views, stored in HBase to be queried in real- time. The goal was to process new data in real- time (currently minutes) and have it ready for a large number of concurrent queries that execute in milliseconds. This set Adobe’s problem apart from what is traditionally solved with Hive or Pig. This talk will describe the design and the strategies (and hacks) used to achieve low latency and scalability, from theoretical model to the entire process of ETL to warehousing and queries.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Growing Your Inbox, HBase at Tumblr</span></a><br />
<a href="http://www.hbasecon.com/speakers/blake-matheny/">Blake Matheny</a>, Director of Platform Engineering at Tumblr</p>
<p>This talk goes into detail about Tumblr’s experience developing Motherboy, an eventually consistent inbox style storage system built around HBase. The SLA, write concurrency, data volume, and failure modes for this application created a number of challenges in developing a solution. The user homing scheme introduced additional complexity that made capacity planning tricky as Tumblr tried to trade off availability and cost. Performance testing of our workload, and automation to support that testing, also provided a number of valuable lessons. This talk will be most useful to people considering HBase for their application, but will have enough detail to be useful to current HBase users as well.</p>
<p><a href="http://hbaseconsf.eventbrite.com/" target="_blank"><img src="http://www.hbasecon.com/wp-content/uploads/2012/02/btn-register-small.png" alt="Register for HBaseCon 2012" /></a></p>
<p>Be sure to check the <a href="http://www.hbasecon.com/agenda">agenda</a> in the coming weeks as we are adding more sessions soon. Remember that the Early Bird registration price expires this Friday April 6 so register soon to take advantage of the discount.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-applications-track/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Apache Bigtop 0.3.0 (incubating) has been released</title>
		<link>http://www.cloudera.com/blog/2012/04/apache-bigtop-0-3-0-incubating-has-been-released/</link>
		<comments>http://www.cloudera.com/blog/2012/04/apache-bigtop-0-3-0-incubating-has-been-released/#comments</comments>
		<pubDate>Tue, 03 Apr 2012 17:58:56 +0000</pubDate>
		<dc:creator>Roman Shaposhnik</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[apache bigtop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14011</guid>
		<description><![CDATA[Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested: Apache Hadoop 1.0.1 [...]]]></description>
			<content:encoded><![CDATA[<p>Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested:</p>
<ul style="padding-left:20px">
<li>Apache Hadoop 1.0.1</li>
<li>Apache Zookeeper 3.4.3</li>
<li>Apache HBase 0.92.0</li>
<li>Apache Hive 0.8.1</li>
<li>Apache Pig 0.9.2</li>
<li>Apache Mahout 0.6.1</li>
<li>Apache Oozie 3.1.3</li>
<li>Apache Sqoop 1.4.1</li>
<li>Apache Flume 1.0.0</li>
<li>Apache Whirr 0.7.0</li>
</ul>
<p>The list of supported Linux platforms has expanded to:</p>
<ul style="padding-left:20px">
<li>Fedora 15 and 16</li>
<li>CentOS and Red Hat Enterprise Linux 5 and 6</li>
<li>SuSE Linux Enterprise 11</li>
<li>Ubuntu 10.04 LTS</li>
<li>Mageia 1</li>
</ul>
<p>This, we hope, will make our user community&#8217;s experience running Apache Hadoop the most seamless Bigtop experience to date: just follow our<a title="Installation Guide" href="https://cwiki.apache.org/confluence/display/BIGTOP/How+to+install+Hadoop+distribution+from+Bigtop" target="_blank"> Installation Guide </a>and you will have your first pseudo-distributed Hadoop PI or Hive query running in no time.</p>
<p>If you&#8217;re thinking about deploying Bigtop to a fully-distributed cluster you might find our improved <a title="Puppet" href="http://puppetlabs.com/" target="_blank">Puppet</a> code to be of assistance. There is some <a title="brief documentation" href="https://svn.apache.org/repos/asf/incubator/bigtop/branches/branch-0.3/bigtop-deploy/puppet/README.md">brief documentation</a>  on how to run our Puppet recipes in a master-less puppet configuration, but they should work just fine in a typical Puppet master setup as well.</p>
<p>Whatever you do, don&#8217;t forget to check us out at <a title="Apache" href="http://incubator.apache.org/bigtop/" target="_blank">Apache</a> and consider getting involved. Bigtop is a community-driven effort and we need your help. Of course, above all we need you to use Bigtop and give us your the feedback.</p>
<p>Happy Big Data discoveries,<br />Your faithful and tireless Bigtop development team!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/04/apache-bigtop-0-3-0-incubating-has-been-released/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>High Availability for the Hadoop Distributed File System (HDFS)</title>
		<link>http://www.cloudera.com/blog/2012/03/high-availability-for-the-hadoop-distributed-file-system-hdfs/</link>
		<comments>http://www.cloudera.com/blog/2012/03/high-availability-for-the-hadoop-distributed-file-system-hdfs/#comments</comments>
		<pubDate>Wed, 07 Mar 2012 13:00:59 +0000</pubDate>
		<dc:creator>Aaron Myers</dc:creator>
				<category><![CDATA[CDH]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[hadoop high availability]]></category>
		<category><![CDATA[hdfs high availability]]></category>
		<category><![CDATA[hdfs name node]]></category>
		<category><![CDATA[high availability]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=13415</guid>
		<description><![CDATA[Background Apache Hadoop consists of two primary components: HDFS and MapReduce. HDFS, the Hadoop Distributed File System, is the primary storage system of Hadoop, and is responsible for storing and serving all data stored in Hadoop. MapReduce is a distributed processing framework designed to operate on data stored in HDFS. HDFS has long been considered [...]]]></description>
			<content:encoded><![CDATA[<h2 style="font-size: 14pt;">Background</h2>
<p>Apache Hadoop consists of two primary components: HDFS and MapReduce. HDFS, the Hadoop Distributed File System, is the primary storage system of Hadoop, and is responsible for storing and serving all data stored in Hadoop. MapReduce is a distributed processing framework designed to operate on data stored in HDFS.</p>
<p>HDFS has long been considered a highly <em>reliable</em> file system.  An empirical <a href="http://www.youtube.com/watch?v=zbycDpVWhp0">study done at Yahoo!</a> concluded that across Yahoo!’s 20,000 nodes running Apache Hadoop in 10 different clusters in 2009, HDFS lost only 650 blocks out of 329 million total blocks. The vast majority of these lost blocks were due to a handful of bugs which have long since been fixed.</p>
<p>Despite this very high level of reliability, HDFS has always had a well-known single point of failure which impacts HDFS’s <em>availability</em>: the system relies on a single Name Node to coordinate access to the file system data. In clusters which are used exclusively for ETL or batch-processing workflows, a brief HDFS outage may not have immediate business impact on an organization; however, in the past few years we have seen HDFS begin to be used for more interactive workloads or, in the case of HBase, used to directly serve customer requests in real time. In cases such as this, an HDFS outage will immediately impact the productivity of internal users, and perhaps result in downtime visible to external users. For these reasons, adding high availability (HA) to the HDFS Name Node became one of the top priorities for the HDFS community.</p>
<p>The remainder of this post discusses the implementation of a new feature for HDFS, called the “HA Name Node.” For a detailed discussion of other issues surrounding the availability of Hadoop as a whole, take a look at this <a href="http://www.cloudera.com/blog/2011/02/hadoop-availability/">excellent blog post</a> by my colleague Eli Collins.<strong><strong></strong></strong></p>
<h2 style="font-size: 14pt;">High-level Architecture</h2>
<p>The goal of the HA Name Node project is to add support for deploying two Name Nodes in an active/passive configuration. This is a common configuration for highly-available distributed systems, and HDFS’s architecture lends itself well to this design. Even in a non-HA configuration, HDFS already requires both a Name Node and another node with similar hardware specs which performs checkpointing operations for the Name Node. The design of the HA Name Node is such that the passive Name Node is capable of performing this checkpointing role, thus requiring no additional Hadoop server machines beyond what HDFS already requires.<img src="http://www.cloudera.com/wp-content/uploads/2012/03/HANNdiagram-2.png" alt="Hadoop Distributed File System High Available Name Node" /></p>
<p>The HDFS Name Node is primarily responsible for serving two types of file system metadata: file system namespace information and block locations. Because of the architecture of HDFS, these must be handled separately.<strong><strong></strong></strong></p>
<p><strong>Namespace Information</strong></p>
<p>All mutations to the file system namespace, such as file renames, permission changes, file creations, block allocations, etc, are written to a persistent write-ahead log by the Name Node before returning success to a client call. In addition to this edit log, periodic checkpoints of the file system, called the fsimage, are also created and stored on-disk on the Name Node. Block locations, on the other hand, are stored only in memory. The locations of all blocks are received via “block reports” sent from the Data Nodes when the Name Node is started.</p>
<p>The goal of the HA Name Node is to provide a <em>hot standby</em> Name Node that can take over serving the role of the active Name Node with no downtime. To provide this capability, it is critical that the standby Name Node has the most complete and up-to-date file system state possible in memory. Empirically, starting a Name Node from cold state can take tens of minutes to load the namespace information (fsimage and edit log) from disk, and up to an hour to receive the necessary block reports from all Data Nodes in a large cluster.</p>
<p>The Name Node has long supported the ability to write its edit logs to multiple, redundant local directories. To address the issue of sharing state between the active and standby Name Nodes, the HA Name Node feature allows for the configuration of a special shared edits directory. This directory should be available via a network file system, and should be read/write accessible from both Name Nodes. This directory is treated as being <em>required</em> by the active Name Node, meaning that success will not be returned to a client call unless the file system change has been written to the edit log in this directory. The standby Name Node polls the shared edits directory frequently, looking for new edits written by the active Name Node, and reads these edits into its own in-memory view of the file system state.</p>
<p>Note that requiring a single shared edits directory does not necessarily imply a new single point of failure. It does, however, mean that the filer providing this shared directory must itself be HA, and that multiple network routes should be configured between the Name Nodes and the service providing this shared directory. Plans to improve this situation are discussed further below.</p>
<p><strong>Block Locations</strong></p>
<p>The other part of keeping the standby Name Node hot is making sure that it has up-to-date block location information. Since block locations aren’t written to the Name Node edit log, reading from the shared edits directory is not sufficient to share this file system metadata between the two Name Nodes. To address this issue, when HA is enabled, all Data Nodes in the cluster are configured with the network addresses of both Name Nodes. Data Nodes send all block reports, block location updates, and heartbeats to both Name Nodes, but Data Nodes will only act on block commands issued by the currently-active Name Node.</p>
<p>With both up-to-date namespace information and block locations in the standby Name Node, the system is able to perform a failover from the active Name Node to the standby with no delay.</p>
<p><strong>Client Failover</strong></p>
<p>Since multiple distinct daemons are now capable of serving as the active Name Node for a single cluster, the HDFS client must be able to determine which Name Node to communicate with at any given time. The HA Name Node feature does not support an active-active configuration, and thus all client calls must go to the active Name Node in order to be served.</p>
<p>To implement this feature, the HDFS client was extended to support the configuration of multiple network addresses, one for each Name Node, which collectively represent the HA name service. The name service is identified by a single <em>logical URI</em>, which is mapped to the two network addresses of the HA Name Nodes via client-side configuration. These addresses are tried in order by the HDFS client. If a client makes a call to the standby Name Node, a special result is returned to the client, indicating that it should retry elsewhere. The configured addresses are tried in order by the client until an active Name Node is found.</p>
<p>In the event that the active Name Node crashes while in the middle of processing a request, the client will be unable to determine whether or not the request was processed. For many operations such as reads (or <a href="http://en.wikipedia.org/wiki/Idempotent">idempotent</a> writes such as setting permissions, setting modification time, etc), this is not a problem &#8212; the client may simply retry after the failover has completed. For others, the error must be bubbled up to the caller to be correctly handled. In the course of the HA project, we extended the Hadoop IPC system to be able to classify each operation’s idempotence using special annotations.</p>
<h2 style="font-size: 14pt;">Current Status</h2>
<p>Active development work began on the HA Name Node in August 2011, in a branch off of Apache Hadoop trunk. Development was done under the umbrella JIRAs <a href="https://issues.apache.org/jira/browse/HDFS-1623">HDFS-1623</a> and <a href="https://issues.apache.org/jira/browse/HADOOP-7454">HADOOP-7454</a>. Last Friday, March 2nd 2012 we merged this branch back into Apache Hadoop trunk. We closed over 170 individual JIRAs in the course of implementing this feature. The stated intention of the community is to merge this work from HDFS trunk into the 0.23 branch, where it will be released as an update of the Apache Hadoop 0.23 release line. Much of this work is already available as part of <a href="http://www.cloudera.com/blog/2012/02/introducing-cdh4/">CDH4 beta 1, released on February 13th, 2012.</a></p>
<p>Once a failover has been initiated, the actual process of stopping the active and starting the standby Name Node takes a matter of seconds or less. This speed allows for little or no detectable service disruption during a failover. I’ve personally run hundreds of MR jobs over a running HA cluster, doing failovers back and forth between two HA Name Nodes, without any job failures.</p>
<p>This first implementation of the HA Name Node supports only manual failover &#8212; that is, failure of one of the Name Nodes is not automatically detected by the system, but rather requires intervention by an operator to initiate a failover between the Name Nodes. Though this is an obvious limitation, this version should still be useful to eliminate the need for planned HDFS downtime in many cases, e.g. changing the configuration of the Name Node, scheduled hardware maintenance of a Name Node, or scheduled OS upgrade of a Name Node.</p>
<h2 style="font-size: 14pt;">Next Up</h2>
<p>The highest priority feature to add to the HA Name Node implementation is support for automatically detecting the failure of the Active Name Node and initiating a failover to the Standby when it is determined that the Active is no longer functional. <a href="https://issues.apache.org/jira/browse/HDFS-3042">HDFS-3042</a> and its sub-tasks are actively being worked on to provide this functionality.</p>
<p>The dependence on an HA filer for HDFS edit logs is a limitation that we’d like to address in the near to medium term as well. Several different options have been discussed to address this:<strong><strong><br /></strong></strong></p>
<ul>
<li><strong>BookKeeper</strong> &#8211; <a href="http://zookeeper.apache.org/doc/r3.2.2/bookkeeperStarted.html">BookKeeper</a> is a highly available write-ahead logging system. Work has already been done to allow the HDFS Name Node to be able to write its edits log to BookKeeper, though this has not yet been tested with the HA Name Node.</li>
<li><strong>Multiple, non-HA filers</strong> &#8211; the HA Name Node presently only supports logging to a single shared edits directory. Perhaps the easiest improvement from the current situation would be to allow the Name Node to log to several shared edits directories, and require that all edits be logged to a quorum of shared edits directories. This proposal is being tracked by <a href="https://issues.apache.org/jira/browse/HDFS-2782">HDFS-2782</a>.</li>
<li><strong>Stream edits to remote NNs</strong> &#8211; in addition to writing edits to a local file system, edit log entries could be sent directly to other Name Nodes over the network. The active Name Node would require a quorum of the involved Name Nodes to acknowledge receipt of the edits before responding with success to the client call.</li>
<li><strong>Store edit logs in HDFS itself</strong> &#8211; systems such as HBase already use HDFS to store a write-ahead log of all data mutations. If HDFS were extended to have a modicum of bootstrapping information, it is not inconceivable that HDFS edit logs could be stored in HDFS itself. This proposal is being discussed on <a href="https://issues.apache.org/jira/browse/HDFS-2601">HDFS-2601</a>.</li>
</ul>
<p>In the next few weeks, we will be evaluating all of these options and selecting one to implement.<br /><strong id="internal-source-marker_0.7045455041807145"><br /></strong>Currently, deploying HA Name Nodes is somewhat cumbersome, requiring the operator to <a href="https://ccp.cloudera.com/display/CDH4B1/HDFS+High+Availability+Initial+Deployment">manually synchronize the on-disk metadata</a> of the two Name Nodes. <a href="https://issues.apache.org/jira/browse/HDFS-2731">HDFS-2731</a> aims to improve the user experience of this deployment process by having the second Name Node automatically synchronize itself with the state of the first Name Node. This feature will make the process faster and less error prone.</p>
<h2 style="font-size: 14pt;">Further Reading</h2>
<p>Take a look at the <a href="https://ccp.cloudera.com/display/CDH4B1/CDH4+Beta+1+High+Availability+Guide">CDH4 docs</a> for detailed information on configuring the HA Name Node in CDH4.</p>
<p>Be on the lookout for an upcoming blog post from my colleague Todd Lipcon, which will go into greater detail about some of the specific challenges encountered while implementing the HA Name Node feature, and how these issues were overcome.</p>
<h2 style="font-size: 14pt;">Acknowledgments</h2>
<p>This work has been a community effort from the start, and represents the work of many contributors. Both the architecture and implementation were the collaborative effort of many. In particular, this work would not have been possible without contributions from Todd Lipcon, Eli Collins, Uma Maheswara Rao G, Bikas Saha, Suresh Srinivas, Jitendra Nath Pandey, Hari Mankude, Brandon Li, Sanjay Radia, Mingjie Lai, and Gregory Chanan. Also thanks to Dhruba Borthakur and Konstantin Shvachko for helpful design discussions and recommendations on testing. Thanks also to Stephen Chu, Wing Yew Poon, and Patrick Ramsey for their help in testing the HA Name Node.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/03/high-availability-for-the-hadoop-distributed-file-system-hdfs/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>January 2012 Bay Area HBase User Group meetup summary + HBaseCon announcement</title>
		<link>http://www.cloudera.com/blog/2012/01/january-2012-bay-area-hbase-user-group-meetup-summary/</link>
		<comments>http://www.cloudera.com/blog/2012/01/january-2012-bay-area-hbase-user-group-meetup-summary/#comments</comments>
		<pubDate>Wed, 25 Jan 2012 20:30:52 +0000</pubDate>
		<dc:creator>David S. Wang</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HBase Meetup]]></category>
		<category><![CDATA[HBase User Group]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=10748</guid>
		<description><![CDATA[More than 150 people attended the San Francisco Bay Area HBase User Group meetup last Thursday, January 19th, at eBay headquarters in San Jose, California.  Presenters from StumbleUpon, Facebook, eBay and MapR shared a wealth of information about Apache HBase operations and optimizations, gleaned from their experience running HBase in production environments. One special item of note: [...]]]></description>
			<content:encoded><![CDATA[<p>More than 150 people attended the San Francisco Bay Area HBase User Group meetup last Thursday, January 19th, at eBay headquarters in San Jose, California.  Presenters from StumbleUpon, Facebook, eBay and MapR shared a wealth of information about <a title="Apache HBase home page" href="http://hbase.apache.org" target="_blank">Apache HBase</a> operations and optimizations, gleaned from their experience running HBase in production environments.</p>
<p>One special item of note:<strong> Michael Stack</strong> announced <a title="HBaseCon home page" href="http://www.hbasecon.com" target="_blank">HBaseCon 2012</a>, taking place this spring in the Bay Area.  This inaugural conference will focus on the growth and education of the HBase community.  While details of the event are not yet published, <strong>the call for speakers is currently open</strong>.  Submit your abstract <a title="HBaseCon call for speakers" href="http://www.hbasecon.com/call-for-speakers/" target="_blank">here</a>.</p>
<p>Many of the talks focused on HBase operations.  Here&#8217;s a summary of those presentations:</p>
<p><strong>Aravind Gottipati</strong> discussed the HBase deployments at StumbleUpon, reflecting on hardware, requirements, configuration, and monitoring tools. Aravind also pointed out some operational challenges StumbleUpon has faced, and suggested some improvements for future HBase versions.  [<a title="Ops @SU" href="http://files.meetup.com/1350427/ebay_pres.zip" target="_blank">slides</a>]</p>
<p>Next,<strong> Paul Tuckfield</strong> presented on HBase operations at Facebook. He shared interesting facts about their deployment, such as how their clusters span multiple racks to avoid network uplinks as a single point of failure, and how their clusters are as slow as their slowest region server.  [<a title="HBase Operations at Facebook" href="http://files.meetup.com/1350427/hbase-ebay-preso-2.pptx" target="_blank">slides</a>]</p>
<p>eBay&#8217;s <strong>Swati Agarwal</strong> and <strong>Thomas Pan</strong> gave a talk on eBay&#8217;s HBase deployments, sharing many statistics about their pre-production deployment, and discussed their need for well-distributed keys and the impact on their rowkey schema. They also talked about their HBase-related challenges, including a need for more stability and how upgrades incur significant downtime.  [<a title="HBase Operations" href="http://files.meetup.com/1350427/EBAY-HBase-Ops.pptx" target="_blank">slides</a>]</p>
<p>By now, the meeting was running a bit behind schedule, so <strong>J.D. Cryans</strong> gave a quick presentation about some experiments he did at StumbleUpon involving different caching configurations and datasets. He showed his numbers in a couple of different runs based on a snapshot of the upcoming CDH3u3 release from Cloudera, which is currently in production at StumbleUpon.  The runs were with with no block cache, short-circuited reads, and 100% block cache. The main takeaway was that it is very important to have a good understanding of how much data that needs to be read for your specific use case, and how this data fits into HBase.  [<a title="Practical Caching" href="http://files.meetup.com/1350427/hug_ebay_jdcryans.pdf" target="_blank">slides</a>]</p>
<p>In addition to the above talks, <strong>Tomer Shiran</strong> from MapR gave an overview of MapR&#8217;s product, and <strong>Mikhail Bautin</strong> from Facebook concluded the meetup with some slides about the various optimizations that Facebook has contributed back to the HBase community in the area of scanner performance.</p>
<p>Slides for all presentations are available <a title="January 19, 2012 Bay Area HBase User Group meetup slides" href="http://www.meetup.com/hbaseusergroup/files/" target="_blank">here</a>, and the link to the meetup web page is <a title="January 2012 Bay Area HBase User Group meetup" href="http://www.meetup.com/hbaseusergroup/events/46702842/" target="_blank">here</a>.</p>
<p>Thanks to eBay for inviting the HBase User Group to their building, and providing the free pizza and beer.  See you at <a title="HBaseCon home page" href="http://www.hbasecon.com" target="_blank">HBaseCon 2012</a>!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/01/january-2012-bay-area-hbase-user-group-meetup-summary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop World 2011 Videos and Slides Available</title>
		<link>http://www.cloudera.com/blog/2012/01/hadoop-world-2011-videos-and-slides-available/</link>
		<comments>http://www.cloudera.com/blog/2012/01/hadoop-world-2011-videos-and-slides-available/#comments</comments>
		<pubDate>Wed, 18 Jan 2012 13:00:31 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hadoop slides]]></category>
		<category><![CDATA[hadoop videos]]></category>
		<category><![CDATA[hadoop world]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=10644</guid>
		<description><![CDATA[Last November in New York City, Hadoop World, the largest conference of Apache Hadoop practitioners, developers, business executives, industry luminaries and innovative companies took place. The enthusiasm for the possibilities in Big Data management and analytics with Hadoop was palpable across the conference. Cloudera CEO, Mike Olson, summarizes Hadoop World 2011 in these final remarks. [...]]]></description>
			<content:encoded><![CDATA[<p>Last November in New York City, Hadoop World, the largest conference of Apache Hadoop practitioners, developers, business executives, industry luminaries and innovative companies took place. The enthusiasm for the possibilities in Big Data management and analytics with Hadoop was palpable across the conference. Cloudera CEO, Mike Olson, summarizes Hadoop World 2011 in these <a href="http://www.cloudera.com/blog/2011/11/hadoop-world-2011-final-remarks/">final remarks</a>.</p>
<p>Those who attended Hadoop World know how difficult navigating a route between two days of five parallel tracks of compelling content can be—particularly since Hadoop World 2011 consisted of sixty-five informative sessions about Hadoop. Understanding that it is nearly impossible to obtain and/or retain all the valuable information shared live at the event, we have compiled all the <a href="http://www.hadoopworld.com/agenda/" target="_blank">Hadoop World presentation slides and videos</a> for perusing, sharing and for reference at your convenience. You can turn to these resources for technical Hadoop help and real-world production Hadoop examples, as well as information about advanced data science analytics.</p>
<p>I’d like to take this opportunity to again thank all who participated in making Hadoop World 2011 a success: sponsors, attendees, the Sheraton New York Hotel &amp; Towers and the Hadoop World production team.</p>
<h2 style="font-size:14pt;color:#243543">Hadoop World Resource Links</h2>
<p><a href="http://www.hadoopworld.com/agenda/" target="_blank"><span style="color:#505050;font-weight:bold">Hadoop World Agenda</span></a>: <a href="http://www.hadoopworld.com/agenda/" target="_blank">http://www.hadoopworld.com/agenda/</a></p>
<ul style="padding-left:20px">
<li>The Hadoop World Agenda has both <a href="http://www.hadoopworld.com/agenda/">Video</a> and <a href="http://www.hadoopworld.com/agenda/">Slide</a> (PPT) links for each presentation located in the presentations designated time slot.</li>
<li>Place your cursor over a time slot to learn more about the presentations content.</li>
</ul>
<p><a href="http://www.cloudera.com/resources/Hadoop+World/"><span style="color:#505050;font-weight:bold">Hadoop World Resources on Cloudera.com</span></a>: <a href="http://www.cloudera.com/resources/Hadoop+World/">http://www.cloudera.com/resources/Hadoop+World/</a></p>
<ul style="padding-left:20px">
<li>Each Hadoop World presentation video and slide deck is listed in the Resources section of the Cloudera web site.</li>
<li>Scroll the list to find a presentation of interest to you.</li>
</ul>
<h2 style="font-size:14pt;color:#243543">Notes</h2>
<ul style="padding-left:20px">
<li>Larry Feinsmith’s keynote presentation will not be listed in conjunction with JPMorgan &amp; Chase’s wishes.</li>
<li>The session, “Life in Hadoop Ops – Tales From the Trenches” does not have a corresponding slide deck as this session was a free flowing discussion panel.</li>
<li>Hadoop World 2012 will take place in the fall. Specific location and dates will be disclosed in the future.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/01/hadoop-world-2011-videos-and-slides-available/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Apache Sqoop: Highlights of Sqoop 2</title>
		<link>http://www.cloudera.com/blog/2012/01/apache-sqoop-highlights-of-sqoop-2/</link>
		<comments>http://www.cloudera.com/blog/2012/01/apache-sqoop-highlights-of-sqoop-2/#comments</comments>
		<pubDate>Fri, 13 Jan 2012 13:00:53 +0000</pubDate>
		<dc:creator>Kathleen Ting</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[Connector]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[sqoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=10319</guid>
		<description><![CDATA[This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop Apache Sqoop (incubating) was created to efficiently transfer bulk data between Hadoop and external structured datastores, such as RDBMS and data warehouses, because databases are not easily accessible by Hadoop. Sqoop is currently undergoing incubation at The Apache Software Foundation. More information on this project [...]]]></description>
			<content:encoded><![CDATA[<p><em>This blog was originally posted on the Apache Blog: <a href="https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop" target="_blank">https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop</a></em></p>
<p><a href="http://incubator.apache.org/sqoop/">Apache Sqoop (incubating)</a> was created to efficiently transfer bulk data between Hadoop and external structured datastores, such as RDBMS and data warehouses, because databases are not easily accessible by Hadoop. Sqoop is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at <a href="http://incubator.apache.org/sqoop" target="_blank">http://incubator.apache.org/sqoop</a>.</p>
<p>The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. That said, to enhance its functionality, Sqoop needs to fulfill data integration use-cases as well as become easier to manage and operate.</p>
<h2 style="font-size: 14pt; color: #243543;">What is Sqoop?</h2>
<p>As <a href="https://blogs.apache.org/sqoop/entry/apache_sqoop_overview" target="_blank">described in a previous blog post</a>, Sqoop is a bulk data transfer tool that allows easy import/export of data from structured datastores such as relational databases, enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from an external system into HDFS, as well as populate tables in Hive and HBase. Similarly, Sqoop integrates with the workflow coordinator Apache Oozie (incubating), allowing you to schedule and automate import/export tasks. Sqoop uses a connector-based architecture which supports plugins that provide connectivity to additional external systems.</p>
<p style="align:middle"><div id="attachment_10373" class="wp-caption middle" style="width: 405px"><a target="_blank" href="https://www.cloudera.com/wp-content/uploads/2012/01/sqoop1arch.jpg"><img align="middle" class="size-full wp-image-10373" title="Sqoop 1.4.0-incubating Architecture" src="https://www.cloudera.com/wp-content/uploads/2012/01/sqoop1arch.jpg" alt="Sqoop 1.4.0-incubating Architecture" width="395" height="456" /></a><p class="wp-caption-text">Figure 1: Sqoop 1.4.0-incubating Architecture</p></div></p>
<h2 style="font-size: 14pt; color: #243543;">Sqoop&#8217;s Challenges</h2>
<p>Sqoop has enjoyed enterprise adoption, and our experiences have exposed some recurring ease-of-use challenges, extensibility limitations, and security concerns that are difficult to support in the original design:
<ul>
<li>Cryptic and contextual command line arguments can lead to error-prone connector matching, resulting in user errors</li>
<li>Due to tight coupling between data transfer and the serialization format, some connectors may support a certain data format that others don&#8217;t (e.g. direct MySQL connector can&#8217;t support sequence files)</li>
<li>There are security concerns with openly shared credentials</li>
<li>By requiring root privileges, local configuration and installation are not easy to manage</li>
<li>Debugging the map job is limited to turning on the verbose flag</li>
<li>Connectors are forced to follow the JDBC model and are required to use common JDBC vocabulary (URL, database, table, etc), regardless if it is applicable</li>
</ul>
<p>These challenges have motivated the design of Sqoop 2, which is the subject of this post. That said, Sqoop 2 is a work in progress whose design is subject to change.</p>
<p>Sqoop 2 will continue its strong support for command line interaction, while adding a web-based GUI that exposes a simple user interface. Using this interface, a user can walk through an import/export setup via UI cues that eliminate redundant options. Various connectors are added in the application in one place and the user is not tasked with installing or configuring connectors in their own sandbox. These connectors expose their necessary options to the Sqoop framework which then translates them to the UI. The UI is built on top of a REST API that can be used by a command line client exposing similar functionality. The introduction of Admin and Operator roles in Sqoop 2 will restrict &#8216;create&#8217; access for Connections to Admins and &#8216;execute&#8217; access to Operators. This model will allow integration with platform security and restrict the end user view to only operations applicable to end users.</p>
<div id="attachment_10503" class="wp-caption alignnone" style="width: 582px"><a href="https://www.cloudera.com/wp-content/uploads/2012/01/sqoop2archMeta3.jpg"><img class="size-full wp-image-10503" title="Sqoop 2 Architecture" src="https://www.cloudera.com/wp-content/uploads/2012/01/sqoop2archMeta3.jpg" alt="Sqoop 2 Architecture" width="572" height="436" /></a><p class="wp-caption-text">Figure 2: Sqoop 2 Architecture</p></div>
<h2 style="font-size: 14pt; color: #243543;">Ease of Use</h2>
<p>Whereas Sqoop requires client-side installation and configuration, Sqoop 2 will be installed and configured server-side. This means that connectors will be configured in one place, managed by the Admin role and run by the Operator role. Likewise, JDBC drivers will be in one place and database connectivity will only be needed on the server. Sqoop 2 will be a web-based service: front-ended by a Command Line Interface (CLI) and browser and back-ended by a metadata repository. Moreover, Sqoop 2&#8242;s service level integration with Hive and HBase will be on the server-side. Oozie will manage Sqoop tasks through the REST API. This decouples Sqoop internals from Oozie, i.e. if you install a new Sqoop connector then you won&#8217;t need to install it in Oozie also.</p>
<h2 style="font-size: 14pt; color: #243543;">Ease of Extension</h2>
<p>In Sqoop 2, connectors will no longer be restricted to the JDBC model, but can rather define their own vocabulary, e.g. Couchbase no longer needs to specify a table name, only to overload it as a backfill or dump operation.</p>
<p>Common functionality will be abstracted out of connectors, holding them responsible only for data transport. The reduce phase will implement common functionality, ensuring that connectors benefit from future development of functionality.</p>
<p>Sqoop 2&#8242;s interactive web-based UI will walk users through import/export setup, eliminating redundant steps and omitting incorrect options. Connectors will be added in one place, with the connectors exposing necessary options to the Sqoop framework. Thus, users will only need to provide information relevant to their use-case.</p>
<p>With the user making an explicit connector choice in Sqoop 2, it will be less error-prone and more predictable. In the same way, the user will not need to be aware of the functionality of all connectors. As a result, connectors no longer need to provide downstream functionality, transformations, and integration with other systems. Hence, the connector developer no longer has the burden of understanding all the features that Sqoop supports.</p>
<h2 style="font-size: 14pt; color: #243543;">Security</h2>
<p>Currently, Sqoop operates as the user that runs the &#8216;sqoop&#8217; command. The security principal used by a Sqoop job is determined by what credentials the users have when they launch Sqoop. Going forward, Sqoop 2 will operate as a server based application with support for securing access to external systems via role-based access to Connection objects. For additional security, Sqoop 2 will no longer allow code generation, require direct access to Hive and HBase, nor open up access to all clients to execute jobs.</p>
<p>Sqoop 2 will introduce Connections as First-Class Objects. Connections, which will encompass credentials, will be created once and then used many times for various import/export jobs. Connections will be created by the Admin and used by the Operator, thus preventing credential abuse by the end user. Furthermore, Connections can be restricted based on operation (import/export). By limiting the total number of physical Connections open at one time and with an option to disable Connections, resources can be managed.</p>
<h2 style="font-size: 14pt; color: #243543;">Summary</h2>
<p> As <a href="https://cwiki.apache.org/confluence/download/attachments/27361435/Sqoop2_wnotes.pdf?version=1&amp;modificationDate=1326152997641" target="_blank">detailed in this presentation</a>, Sqoop 2 will enable users to use Sqoop effectively with a minimal understanding of its details by having a web-application run Sqoop, which allows Sqoop to be installed once and used from anywhere. In addition, having a REST API for operation and management will help Sqoop integrate better with external systems such as Oozie. Also, introducing a reduce phase allows connectors to be focused only on connectivity and ensures that Sqoop functionality is uniformly available for all connectors. This facilitates ease of development of connectors.</p>
<p>We encourage you to participate in and contribute to Sqoop 2&#8242;s <a href="https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2" target="_blank">Design</a> and Development <a href="https://issues.apache.org/jira/browse/SQOOP-365" target="_blank">(SQOOP-365)</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/01/apache-sqoop-highlights-of-sqoop-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

