<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; Chad Metcalf</title>
	<atom:link href="http://www.cloudera.com/blog/author/chad/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Thu, 24 May 2012 17:53:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>What is in our Kitchen?</title>
		<link>http://www.cloudera.com/blog/2010/09/what-is-in-our-kitchen/</link>
		<comments>http://www.cloudera.com/blog/2010/09/what-is-in-our-kitchen/#comments</comments>
		<pubDate>Tue, 21 Sep 2010 06:22:09 +0000</pubDate>
		<dc:creator>Chad Metcalf</dc:creator>
				<category><![CDATA[careers]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[development]]></category>
		<category><![CDATA[kitchen]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4810</guid>
		<description><![CDATA[If there is one thing that chefs are proud of, it&#8217;s their kitchens. Whether cavernous top-of-the-line affairs or cramped New York apartments, kitchens are the place where raw ingredients are combined with talent and hard work to produce results. The only difference in the world of software is what you will find in our kitchens.&#160; [...]]]></description>
			<content:encoded><![CDATA[<p>If there is one thing that chefs are proud of, it&#8217;s their kitchens. Whether <a href="http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2010/09/10/FDIP1F26JG.DTL">cavernous top-of-the-line affairs</a> or <a href="http://well.blogs.nytimes.com/2008/11/20/mark-bittmans-bad-kitchen/">cramped New York apartments</a>, kitchens are the place where raw ingredients are combined with talent and hard work to produce results. The only difference in the world of software is what you will find in our kitchens.&#160;<span id="more-4810"></span> In an <a rel="nofollow" href="http://news.cnet.com/8301-30684_3-10309375-265.html">interview</a> with CNET, Google&#8217;s Hal Varian attributed Google&#8217;s success to the &#8220;kitchen&#8221; in which their products are developed:<a href="/wp-content/uploads/2010/09/StockPots_0673.jpg"><img class="size-medium wp-image-4827" align="right" src="/wp-content/uploads/2010/09/StockPots_0673-300x198.jpg" alt="" style="margin:15px;margin-right:0" width="300" height="198" /></a></p>
<blockquote><p>&#8220;I also think we have a better kitchen. We&#8217;ve put a lot of effort into  building a really powerful infrastructure at Google, the development  environment at Google is very good.&#8221;</p></blockquote>
<p>The goal of the Kitchen team at Cloudera is to create a powerful  infrastructure for developing, building, testing,  shipping, and supporting our software. Kitchen contributes its expertise to every product Cloudera builds, while also building  out new infrastructure and tools to facilitate future development. Everyone on the Kitchen team writes software. </p>
<p>While the Kitchen team&#8217;s culture was initially inspired by Google&#8217;s infrastructure, we agree with Piaw Na who recently <a href="http://piaw.blogspot.com/2010/04/infrastructure.html">provided some words of caution</a> for companies looking to follow this example:</p>
<blockquote><p> &#8220;In short, I think startups have to be very careful about building generic infrastructure just because that&#8217;s the way Google did things.&#8221;</p></blockquote>
<p>The Kitchen team builds the infrastructure that is needed to solve our company&#8217;s problems. For example, our build system must be capable of coalescing many disparate open source projects into a unified platform. If there is an existing open source tool or framework that meets our needs we use it, improve it, and contribute it back to the project rather then &#8220;rolling our own&#8221;</p>
<p>We use many of the open source tools you might expect, such as <a href="http://hudson-ci.org/">Hudson</a> for continuous integration. Our Hudson instance manages tens of hosts running over seventy projects:</p>
<ul>
<li>Unit tests running on every commit, across multiple platforms, and flavors of Java or Python</li>
<li>Hadoop clusters running on EC2 using <a href="http://incubator.apache.org/projects/whirr.html">Apache Whirr</a></li>
<li>Various code improvement tools such as <a href="http://www.jcarder.org/">jcarder</a>, <a href="http://cobertura.sourceforge.net/">Cobertura</a>, <a href="http://www.atlassian.com/software/clover/">Clover</a>, <a href="http://findbugs.sourceforge.net/">FindBugs</a>, <a href="http://checkstyle.sourceforge.net/">CheckStyle</a> and others</li>
</ul>
<p>If a tool does not exist the Kitchen team tries to leverage existing frameworks to build what is required. For example, our automated build and release system, which is at the heart of the <a href="../blog/2010/08/cdh3b2-release-recap/">Cloudera Distribution for Hadoop (CDH)</a> platform, is built on top of <a href="http://code.google.com/p/boto/">boto</a>. From a single git repository, we use <a href="http://github.com/cloudera/crepo">crepo</a> (another Kitchen project) to check out the latest source of each project within CDH. Then we build source artifacts for all of the projects, which get uploaded to S3. We then spin up an EC2 cluster to build everything for all the supported CentOS releases, Ubuntu, and Debian releases, including both 32 and 64-bit architectures. The resulting packages are stored back in S3, and then staged to a fresh EC2 instance of <a href="http://archive.cloudera.com">archive.cloudera.com</a> for testing. Additional EC2 instances follow and run end-to-end package tests for each package that was built. We turn the crank nightly, not just for each release.</p>
<p>The Kitchen team is in the process of building a <a href="http://culturedcode.com/status/">status</a>, <a href="http://markcipolla.com/hudson-global-dashboard/">dashboard</a>, <a href="http://twitpic.com/fhjlw">radiator</a>, <a href="http://www.panic.com/blog/2010/03/the-panic-status-board/">single-pane-of-glass</a> to <a href="http://www.samsung.com/us/consumer/professional-displays/professional-displays/lcd/LH46MRTLBC/ZA/index.idx?pagetype=prd_detail">prominently display</a> Hudson&#8217;s status, nightly builds, JIRA stats, CDH download statistics, and many other metrics we use daily.</p>
<p>No software company is complete without a cluster or two. Kitchen maintains a development cluster, a long-lived CDH cluster, a security-enabled CDH cluster, and a &#8220;dog-food&#8221; cluster. We&#8217;re currently building out a <a title="Eucalyptus" href="http://www.eucalyptus.com/">Eucalyptus</a> cluster so we can also run our build and test infrastructure in house. We have a large scale cluster in the works and we are busy building out our infrastructure to accommodate it.&#160; We use <a href="https://fedorahosted.org/cobbler/">Cobbler</a>, run <a href="http://ganglia.sourceforge.net/">Ganglia</a> (bias alert, we employ one of the original authors), debate <a href="http://www.opscode.com/chef">Chef</a> and <a href="http://www.puppetlabs.com/">Puppet</a>.</p>
<p>Our Kitchen team is growing. If this sounds like a team you would like to be a part of, get in touch with me on <a title="Cloudera Twitter" href="http://twitter.com/metcalfc">twitter</a> or IRC (#cloudera on freenode.net) or <a href="http://www.cloudera.com/company/careers/">apply directly</a>. Stay tuned for more blog posts about what&#8217;s cooking in our Kitchen.</p>
<p><em>Image courtesy of Chef Olive at <a href="http://kitchenonfire.com/">Kitchen On Fire</a> </em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/09/what-is-in-our-kitchen/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>CDH2 is released</title>
		<link>http://www.cloudera.com/blog/2010/03/cdh2-is-released/</link>
		<comments>http://www.cloudera.com/blog/2010/03/cdh2-is-released/#comments</comments>
		<pubDate>Wed, 24 Mar 2010 14:59:22 +0000</pubDate>
		<dc:creator>Chad Metcalf</dc:creator>
				<category><![CDATA[distribution]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[pig]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=2889</guid>
		<description><![CDATA[We&#8217;re proud to announce that Cloudera&#8217;s Distribution for Hadoop Version 2 (CDH2) is officially released. We&#8217;ve come a long way to get to a production quality release. At the beginning of September we announced the first beta of CDH2. After 6 months of additional testing we announced a release candidate. The release candidate spent over [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;re proud to announce that <a href="http://archive.cloudera.com/docs/cdh.html">Cloudera&#8217;s Distribution for Hadoop Version 2</a> (CDH2) is officially released.</p>
<p>We&#8217;ve come a long way to get to a production quality release. At the beginning of September we announced <a href="http://www.cloudera.com/blog/2009/09/cdh2-clouderas-distribution-for-hadoop-2/">the first beta</a> of CDH2. After 6 months of additional testing we <a href="../blog/2010/02/cdh2-testing-heading-towards-stable/">announced a release candidate</a>. The release candidate spent over a month hardening in Cloudera&#8217;s internal QA process and on a wide variety of customer clusters. CDH2 is now stable and ready for use &#8211; we are pleased to recommend it to all our production users.</p>
<p>CDH2 is based on Apache Hadoop 0.20 &#8211; a release that has been available for almost a year. During this time, the Apache Hadoop community has produced hundreds of bug fixes, improvements and features. Cloudera is proud to have contributed many of these and&#160;incorporated them into CDH2. &#160;For more information, please review the following resources:</p>
<ul>
<li><a href="http://archive.cloudera.com/cdh/2/hadoop-0.20.1+169.68.releasenotes.html">The release notes</a> for CDH2. All bug fixes and improvements are covered in detail.</li>
<li>For new features you&#8217;ll want to checkout CDH3, <a href="../blog/2010/03/cdh3-beta1-now-available/">which is now in beta</a>.</li>
<li>For how to get started, please have a look at <a href="http://archive.cloudera.com/docs/_choosing_a_version.html">CDH documentation</a>, which includes a helpful bit on determining which version is right for you.</li>
</ul>
<p>Hadoop is a community effort. We&#8217;d like to thank everyone who contributes to Hadoop, especially the substantial contribution made by the big team at Yahoo! and all the other users who have contributed to this release. We appreciate the feedback on <a href="http://getsatisfaction.com/cloudera/products/cloudera_cloudera_s_distribution_for_hadoop">Get Satisfaction</a>, <a href="http://twitter.com/cloudera">twitter</a> and <a href="http://webchat.freenode.net/?channels=cloudera">IRC</a> (#cloudera on freenode.net). Keep it coming, and thanks for using Cloudera&#8217;s Distribution for Hadoop!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/03/cdh2-is-released/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>CDH2: &#8220;Testing&#8221; Heading Towards &#8220;Stable&#8221;</title>
		<link>http://www.cloudera.com/blog/2010/02/cdh2-testing-heading-towards-stable/</link>
		<comments>http://www.cloudera.com/blog/2010/02/cdh2-testing-heading-towards-stable/#comments</comments>
		<pubDate>Fri, 19 Feb 2010 01:19:39 +0000</pubDate>
		<dc:creator>Chad Metcalf</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[testing]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=2601</guid>
		<description><![CDATA[In September 2009, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same &#8220;soak time&#8221; as our [...]]]></description>
			<content:encoded><![CDATA[<p>In September 2009, we announced <a href="http://www.cloudera.com/blog/2009/09/cdh2-clouderas-distribution-for-hadoop-2/">the first release of CDH2</a>, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same &#8220;soak time&#8221; as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable. It&#8217;s a long road of feedback, bug fixes, QA and testing to move from testing to stable. As someone who tracks the maturity of a testing build throughout its life cycle, I&#8217;m pleased to say we&#8217;ve put a lot of polish into this release.<br />
<span id="more-2601"></span>CDH2 has reached the point where we are preparing to promote it to stable. One might even call this a &#8220;release candidate&#8221;. Cloudera engineers have been hard at work getting patches into CDH2 to make it the best 0.20 release available. Here are some of the highlights:</p>
<ul>
<li>Hadoop 0.20.1 &#8211; <a href="http://archive.cloudera.com/cdh/testing/hadoop-0.20.1+169.56.CHANGES.txt">73 more patches of extra Hadoop&#8217;y goodness (that is 225 total patches over vanilla 0.20.1)</a></li>
<li>Lots of libhdfs and fusefs love resulting in stability and usability improvements currently in use at scale</li>
<li>HDFS fixes that improve the write pipeline</li>
<li>Lots of general stability fixes for Hadoop</li>
<li>Pig 0.5.0 release &#8211; <a href="http://archive.cloudera.com/cdh/testing/pig-0.5.0+11.1.CHANGES.txt">Working out of the box with our Hadoop 0.18 and 0.20 builds</a></li>
<li>Hive 0.4.1 release &#8211; <a href="http://archive.cloudera.com/cdh/testing/hive-0.4.1+14.4.CHANGES.txt">Works with both of our Hadoop 0.18 and 0.20</a></li>
<li>HBase 0.20.3 &#8211; We worked with the HBase team to bring the latest rpms to a <a href="http://archive.cloudera.com/redhat/cdh/cloudera-contrib.repo">yum repo</a> near you</li>
</ul>
<p>We are excited about our CDH2 release. Its running at scale at some really great companies. We are looking forward to promoting it to stable shortly and moving on to the next big thing, CDH3. I&#8217;ll let you know as soon as this happens. When CDH2 becomes stable, it also means that CDH3 is ready to start its journey through testing. Stay tuned for more details as to what CDH3 will encompass; I&#8217;ll just say that I&#8217;m pretty excited about it.</p>
<p>You can subscribe to our CDH mailing list (<a href="mailto:cdh-announce-subscribe@cloudera.com">cdh-announce-subscribe@cloudera.com</a>) to get information about new releases as we push them out. Check out the new release, and remember to let us know what you think!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/02/cdh2-testing-heading-towards-stable/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>CDH2: Testing Release now with Pig, Hive, and HBase</title>
		<link>http://www.cloudera.com/blog/2009/09/cdh2-testing-release-now-with-pig-hive-and-hbase/</link>
		<comments>http://www.cloudera.com/blog/2009/09/cdh2-testing-release-now-with-pig-hive-and-hbase/#comments</comments>
		<pubDate>Wed, 30 Sep 2009 14:10:37 +0000</pubDate>
		<dc:creator>Chad Metcalf</dc:creator>
				<category><![CDATA[distribution]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[testing]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/blog/?p=1346</guid>
		<description><![CDATA[At the beginning of September, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same &#8220;soak time&#8221; [...]]]></description>
			<content:encoded><![CDATA[<p>At the beginning of September, we announced the <a href="http://www.cloudera.com/blog/2009/09/10/cdh2-clouderas-distribution-for-hadoop-2/">first release of CDH2</a>, our current <tt>testing</tt> repository. Packages in our <tt>testing</tt> repository are recommended for people who want more features and are willing to upgrade as bugs are worked out.  Our <tt>testing</tt> packages pass unit and functional tests but will not have the same &#8220;soak time&#8221; as our <tt>stable</tt> packages.  A <tt>testing</tt> release represents a work in progress that will eventually be promoted to <tt>stable</tt>.</p>
<p>We plan on pushing new packages into the <tt>testing</tt> repository every 3 to 6 weeks.&#160; And it just so happens it is just about 3 weeks after we announced the first testing release. So it must be time for a new one. Here are some of the highlights:</p>
<ul>
<li><strong>Hadoop 0.20.1</strong> &#8211; Bumps the hadoop package up to the <a href="http://hadoop.apache.org/common/docs/r0.20.1/changes.html">0.20.1 release</a> and adds 133 patches worth of <a href="http://archive.cloudera.com/cdh/testing/hadoop-0.20.1+120.CHANGES.txt">extra goodness</a></li>
<li><strong>Alternatives for Hadoop</strong> &#8211; Now you can have both 0.18 and 0.20 installed and use the alternatives system to pick a default</li>
<li><strong><strong>Pig</strong> 0.50 pre-release</strong> &#8211; We included some magic to get things working out of the box with both 0.18 and 0.20</li>
<li><strong>Hive 0.40 pre-release </strong>- Integrated with the alternatives setup out the box works with 0.18 and 0.20</li>
<li><strong>HBase 0.20</strong> &#8211; We worked with the HBase team to bring rpms to a <a href="http://archive.cloudera.com/redhat/cdh/cloudera-contrib.repo">yum repo</a> near you</li>
</ul>
<p>A project as large as Hadoop is a communal effort. Cloudera is proud to be part of that community and hope that our products and services make Hadoop even more accessible to a wider audience. We&#8217;d like to thank everyone who contributes to Hadoop, especially the Yahoo! team for all of their hard work on getting 0.20.1 released, the developers at Facebook and those working on Pig, Hive and HBase.</p>
<p>We are just getting the ball rolling here. You can subscribe to our CDH mailing list (<a href="mailto:cdh-announce-subscribe@cloudera.com" target="_blank">cdh-announce-subscribe@cloudera.com</a>) to get information about new releases as we push them out. Check out the new release, and remember to let us know what you think!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2009/09/cdh2-testing-release-now-with-pig-hive-and-hbase/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>HBase Available in CDH2</title>
		<link>http://www.cloudera.com/blog/2009/09/hbase-available-in-cdh2/</link>
		<comments>http://www.cloudera.com/blog/2009/09/hbase-available-in-cdh2/#comments</comments>
		<pubDate>Tue, 29 Sep 2009 14:00:16 +0000</pubDate>
		<dc:creator>Chad Metcalf</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/blog/?p=1396</guid>
		<description><![CDATA[One of the more common requests we receive from the community is to package HBase with Cloudera&#8217;s Distribution for Hadoop. Lately, I&#8217;ve been doing a lot of work on making Cloudera&#8217;s packages easy to use, and recently, the HBase team has pitched in to help us deliver compatible HBase packages. We&#8217;re pretty excited about this, [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p><em>One of the more common requests we receive from the community is to package HBase with Cloudera&#8217;s Distribution for Hadoop. Lately, I&#8217;ve been doing a lot of work on making Cloudera&#8217;s packages easy to use, and recently, the HBase team has pitched in to help us deliver compatible HBase packages. We&#8217;re pretty excited about this, and we&#8217;re looking forward to your feedback. A big thanks to <a href="mailto:apurtell@apache.org">Andrew Purtell</a>, a Senior Architect at TrendMicro and HBase Contributor, for leading this packaging project and providing this guest blog post. -Chad Metcalf</em></p></blockquote>
<p><strong>What is HBase?</strong><br />
HBase is an open-source, distributed, column-oriented store modeled after Google&#8217;s Bigtable large scale structured data storage system. You can read Google&#8217;s <a href="http://labs.google.com/papers/bigtable.html">Bigtable paper here</a>.</p>
<blockquote><p>&#8220;Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from back end bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.&#8221;</p></blockquote>
<p>HBase extends the publicly shared aspects of the Bigtable architecture and design as described in the Bigtable OSDI&#8217;06 paper with community developed improvements and enhancements:</p>
<ul>
<li>Convenient base classes for backing Hadoop MapReduce jobs with HBase tables</li>
<li>Query predicate push down via server side scan and get filters</li>
<li>Optimizations for real time queries</li>
<li>A high performance Thrift gateway</li>
<li>A REST-ful Web service gateway that supports XML, Protobuf, and binary data encoding options</li>
<li><a href="http://cascading.org/">Cascading</a> source and sink modules</li>
<li>A JRuby-based shell</li>
<li>Support for exporting metrics via the Hadoop metrics subsystem to files or <a href="http://ganglia.info">Ganglia</a>; or via JMX</li>
</ul>
<p>This most recent version of HBase, 0.20.0, has greatly improved on its<br />
predecessors:</p>
<ul>
<li>No HBase single point of failure</li>
<li>Rolling restart for configuration changes and minor upgrades</li>
<li>Generally, one order of magnitude performance improvement for every class of operation</li>
<li>Random access performance on par with open source relational databases such as MySQL</li>
</ul>
<p>We use <a href="http://hadoop.apache.org/zookeeper/">ZooKeeper</a> as a substitute for Google&#8217;s &#8220;Chubby&#8221; to enable hot fail over should a Master node fail. We do other interesting things with ZooKeeper as well and on our roadmap is increasingly distributed function via emergent behaviors with no central point of control.</p>
<p>Unfortunately the HDFS NameNode is still a single point of failure in Hadoop 0.20, and HBase depends on HDFS. For more information on mitigating this risk, see this <a href="http://www.cloudera.com/blog/2009/07/22/hadoop-ha-configuration/">Cloudera blog post on NameNode High Availability</a>.</p>
<p>For more detail, please visit the HBase wiki. Links and references to additional information appear below.</p>
<p><strong>Why Would You Need HBase?</strong></p>
<p>Use HBase when you need fault-tolerant, random, real time read/write access to data stored in HDFS. Use HBase when you need strong data consistency. HBase provides Bigtable-like capabilities on top of Hadoop. HBase&#8217;s goal is the hosting of very large tables &#8212; billions of rows times millions of columns &#8212; atop clusters of commodity hardware.</p>
<p>HBase is an answer for effectively managing terabytes of mutating structured storage on the Hadoop platform at reasonable cost. HBase manages structured data on top of HDFS for you, efficiently using the underlying replicated storage as backing store to gain the benefits of its fault tolerance and data availability and locality. HBase hides the gory details of how one would provide random real time read/write access on top of a filesystem tuned for MapReduce jobs that process terabytes of data, where file block sizes are huge, and where a file can be open for reading or for writing, but not both.</p>
<p>At large scale traditional relational databases (RDBMSes) fall down. We are considering here big queries, typically range or table scans; and big tables, typically terabytes or petabytes. Such workloads generally exceed the ability of these systems to process them in a timely, cost-effective manner. Managing very large storage with them alone is an expensive proposition. Processing that data incurs other cost &#8212; in time, in productivity. Waits and deadlocks rise nonlinearly with transaction size and concurrency, the square of concurrency, the third power of the transaction size. In contrast, HBase table scans run in linear time, and row lookup or update times are logarithmic with respect to the size of the table.</p>
<p>Features of the relational model get in the way as data volumes scale up and analytics get more complex and interesting. Expensive commercial RDBMS systems can deliver large storage capacities and they can execute some queries over all that data in reasonable time, but only at high dollar cost. Using open source RDBMSes at scale simply requires giving up all relational features (e.g. secondary indexes) for performance. Sharding is a brittle and complex non-solution to these scalability problems. It is something done when there are no better alternatives.</p>
<p>What if we trade relational features for performance since using RDBMSes at scale often requires giving them up anyway? The first casualty of sharding is the normalized schema. We can avoid waits and deadlocks by restricting transaction scope to groups of row mutations only. What if we generalize the data model? Then we can provide transparent horizontal scalability without architectural limits &#8212; generic &#8220;self-sharding&#8221;. We can also provide fault tolerance and data availability by way of the same mechanisms which allow this scalability.</p>
<p>Bigtable and HBase are able to avoid the scalability issues that trouble RDBMSes by eschewing the relational data model. HBase provides something else. It is like a large distributed map. It is a row indexed list of tags and data, values of variable length, bounded by a configuration setting.&#160; It is a column based data store. Keys are arbitrary byte data and are multidimensional: row, column, optional column qualifier, and timestamp. Columns in HBase are multiversioned. You can store more than one version of a value in a particular row and column, and the timestamp provides an extra dimension of indexability.&#160; This can be a particularly useful feature: Multiversioning and timestamps avoid edit conflicts caused by concurrent decoupled processes. Rows are stored in byte lexicographic sorted order.</p>
<p>Lexicographically similar values are packed adjacent to one another into blocks in the column stores and are retrieved efficiently together. Column stores may optionally be compressed on disk. Tables are dynamically split into regions. Regions are hosted on a number of region servers. Adding additional capacity to a HBase cluster is a simple and transparent process: Provision another region server, configure it, and start it. Typically, HBase region servers are co-deployed with Hadoop HDFS DataNodes. The underlying storage capacity grows also. As regions grow, they are split and distributed evenly among the storage cluster to level load. Splits are almost instantaneous. A cluster master process manages region assignment for fast recovery and fine grained load balancing. The Master role falls over to spares as necessary for fault tolerance. The Master rapidly redeploys regions from failed nodes to others. Because the stores are in HDFS, all region servers in the cluster have immediate access to the replicated table data.</p>
<p><strong>When Would You Not Want To Use HBase?</strong></p>
<p>When your data access patterns are largely sequential over immutable data. Use plain MapReduce.</p>
<p>When your data is not large.</p>
<p>When the large overheads of the extract-transform-load (ETL) of your data into alternatives such as Hive is not an issue because you are purely operating on the data in a batching manner and can afford to wait, and some feature of the alternative is simply a must-have.</p>
<p>If you need to make a different trade off between consistency and availability. HBase is a strongly consistent system. HBase regions can be temporarily unavailable during fault recovery. The HBase client API will suspend pending reads and writes until the regions come back online. Perhaps for your use case blocking of any kind for any length of time is intolerable.</p>
<p>If you just can&#8217;t live without SQL.</p>
<p>When you really do require normalized schemas or a relational query engine.</p>
<p>However, this last point can use some additional detail. HBase supports random, real time read/write access to your data by way of a single index. However, secondary indexes can be emulated by managing additional index tables at the application level. To achieve fast query response times under real world conditions, &#8220;Web 2.0&#8243; applications often denormalize and replicate and synchronize values in multiple tables anyway. Bigtable&#8217;s simpler data model is sufficient for many such use cases and furthermore does not support constructs that can get you into trouble. What you do get is:</p>
<ul>
<li>Fast (logarithmic time) lookup using row key, optional column and column qualifiers for result set filtering and optional timestamp;</li>
<li>Full table scans;</li>
<li>Range scans, with optional timestamp;</li>
<li>Queries for most recent version or N versions;</li>
<li>Partial key lookups: When combined with compound keys, these have the same properties as leading left edge indexes with the benefit of a distributed index;</li>
<li>Server side filters, a form of query push down.</li>
</ul>
<p>And, while HBase does not support joining data from multiple tables, you can implement your data workflows using <a href="http://cascading.org/">Cascading</a> or a similar higher level construct on top of HBase to recover some relational algebraic operators; or you can simply do ?insert time joins? &#8212; denormalization, view materialization, and so on.</p>
<p><strong>How Do You Try Out HBase?</strong></p>
<p>Installing and configuring HBase on the CDH2 is fast and easy.</p>
<p>Before we begin, note that HBase requires an available Zookeeper ensemble.&#160; The CDH2 packages for HBase includes a Zookeeper package. You can install and configure it and then point HBase to it, or you can let HBase create and manage a private Zookeeper ensemble using the bundled Zookeeper jar. For new users who do not have Zookeeper already set up, it is easiest to just let HBase take care of it. The instructions below assume this is the case.</p>
<p>Also, let&#8217;s consider what is a reasonable test deployment.</p>
<p>Google aims for ~100 regions per region server, and each region is kept to around 200 MB. Large RAM per node and reasonable region counts and sizing means many tables can be cached and served entirely out of RAM. Bigtable is big because there are 100s if not 1000s of nodes participating. The performance numbers in the Bigtable paper are impressive because of the above. It is cheap (for Google) because they build their own hardware and buy components in bulk.</p>
<p>HBase operates in a different world. Many evaluators or new users expect a lot more for a lot less. They don&#8217;t build their own hardware &#8212; but could and maybe should &#8212; and don&#8217;t make bulk purchases. Rather, test deployments of 3 or 4 standard type servers are common. Often the hardware is underpowered for the attempted load. Sometimes even smaller deployments are considered, or even virtual machines are used, but those do not make any sense except as programmer tools. While a &#8220;pseudo-distributed&#8221; configuration for HBase is included in the distribution, a single server deployment is suitable only for very limited testing. We recommend that three servers be considered a minimum test deployment. These can be Amazon EC2 instances, but use c1.xlarge instances.</p>
<p>A reasonable physical server configuration could be:</p>
<ul>
<li>Dual quad core CPU</li>
<li>8 GB RAM or more (4 GB is passable, but constrain MapReduce to only 1 concurrent mapper and reducer per node)</li>
<li>4 x 250 GB data disk attached as JBOD (for the DataNode process)</li>
</ul>
<p>The reason for the resource demand is simple: The typical deployment combines HDFS, MapReduce, and HBase over all servers uniformly. A good rule of thumb here is each Hadoop and HBase daemon requires 1 available CPU core and 1 GB of heap. Each mapper or reducer task requires 1 CPU core and 200 MB of heap by default, more if asked for. For HBase to achieve best performance, the region servers must be given sufficient heap to buffer writes and cache blocks for repeated reads. The higher the write load or the larger the working set, the more larger heap allocations will be useful. Configuring 2GB or 4GB heap for HBase region servers is not uncommon.</p>
<p>On to a quick install:</p>
<p>1) On each server, install the core HBase RPMs: hbase, hbase-native, hbase-master, hbase-regionserver, hbase-zookeeper, hbase-conf-pseudo, hbase-docs.</p>
<p>2) On each server, create the cluster configuration and use &#8216;alternatives&#8217; to enable it.</p>
<p>Create the configuration:</p>
<p><code>% mkdir /etc/hbase-0.20/conf.my_cluster<br />
% cp /etc/hbase-0.20/conf.pseudo/* /etc/hbase-0.20/conf.my_cluster<br />
% vi /etc/hbase-0.20/conf.my_cluster/hbase-site.xml</code></p>
<p>Set up the ZooKeeper quorum:</p>
<p><code>&lt;property&gt;<br />
&lt;name&gt;hbase.zookeeper.quorum&lt;/name&gt;<br />
&lt;value&gt;host1,host2,host3&lt;/value&gt;<br />
&lt;/property&gt;</code></p>
<p>Point HBase root to a folder to be created in HDFS:</p>
<p><code>&lt;property&gt;<br />
&lt;name&gt;hbase.rootdir&lt;/name&gt;<br />
&lt;value&gt;hdfs://namenode:nnport/hbase&lt;/value&gt;<br />
&lt;/property&gt;</code></p>
<p>Note: Do not create this folder yourself.</p>
<p>Enable distributed operation:</p>
<p><code>&lt;property&gt;<br />
&lt;name&gt;hbase.cluster.distributed&lt;/name&gt;<br />
&lt;value&gt;true&lt;/value&gt;<br />
&lt;/property&gt;</code></p>
<p>On small clusters reduce DFS replication to speed writes:</p>
<p><code>&lt;property&gt;<br />
&lt;name&gt;dfs.replication&lt;/name&gt;<br />
&lt;value&gt;2&lt;/value&gt;<br />
&lt;/property&gt;</code></p>
<p>Use the new configuration:</p>
<p><code>% alternatives --install /etc/hbase-0.20/conf hbase-0.20-conf \<br />
/etc/hbase-0.20/conf.my_cluster 50</code></p>
<p>3) Bring Hadoop HDFS up as you would normally.</p>
<p>4) On all cluster nodes, start zookeeper:</p>
<p><code>% service hbase-zookeeper start</code></p>
<p>5) On the designated master, start the master process:</p>
<p><code>% service hbase-master start</code></p>
<p>6) On the designated backup master, start another master process:</p>
<p><code>% service hbase-master start</code></p>
<p>7) On the designated slaves, start the region server processes:</p>
<p><code>% service hbase-regionserver start</code></p>
<p>8 ) Anywhere on the cluster, launch the HBase shell and create a table:</p>
<p><code>% su - hadoop<br />
% hbase shell</code></p>
<p>HBase Shell; enter &#8216;<code>help&lt;RETURN&gt;</code>&#8216; for list of supported commands. Version: 0.20.0~1-1.cloudera</p>
<p><code>hbase(main):001:0&gt; create 'TestTable', {NAME=&gt;'test'}</p>
<p>0 rows(s) in 3.4460 seconds</p>
<p>hbase(main):002:0&gt;</code></p>
<p>9) Somewhere else on the cluster, launch the HBase shell and describe your new table:</p>
<p><code>% su - hadoop<br />
% hbase shell</code></p>
<p>HBase Shell; enter &#8216;<code>help&lt;RETURN&gt;</code>&#8216; for list of supported commands. Version: 0.20.0~1-1.cloudera</p>
<p><code>hbase(main):001:0&gt; describe 'TestTable'</p>
<p>DESCRIPTION&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; ENABLED<br />
{NAME =&gt; 'TestTable', FAMILIES =&gt; [{NAME =&gt; 'test',&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; true<br />
COMPRESSION =&gt; 'NONE', VERSIONS =&gt; '3', TTL =&gt; '2147483647',<br />
BLOCKSIZE =&gt; '65536', IN_MEMORY =&gt; 'false', BLOCKCACHE =&gt;<br />
'true'}]}<br />
1 row(s) in 1.6820 seconds<br />
</code></p>
<p>You are now ready to try out your new HBase installation!</p>
<p><strong>For More Information</strong></p>
<p>Visit the HBase <a href="http://hbase.org/">Website</a> and <a href="http://wiki.apache.org/hadoop/Hbase">Wiki</a>.</p>
<p><a href="http://wiki.apache.org/hadoop/Hbase/PoweredBy">Users</a> and <a href="http://wiki.apache.org/hadoop/SupportingProject">supporting projects</a>.</p>
<p><a href="http://wiki.apache.org/hadoop/HBase/RoadMaps">HBase Roadmap</a></p>
<p><a href="http://hadoop.apache.org/hbase/mailing_lists.html">HBase Mailing List</a></p>
<p>IRC Channel: #hbase on Freenode</p>
<p>Committers and core contributors are here on a regular basis. More active than the Hadoop forums!</p>
<p>Follow us on Twitter: @hbase</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2009/09/hbase-available-in-cdh2/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>

