<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; HBase</title>
	<atom:link href="http://www.cloudera.com/blog/category/hbase/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Thu, 24 May 2012 17:53:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Apache HBase 0.94 is now released</title>
		<link>http://www.cloudera.com/blog/2012/05/apache-hbase-0-94-is-now-released/</link>
		<comments>http://www.cloudera.com/blog/2012/05/apache-hbase-0-94-is-now-released/#comments</comments>
		<pubDate>Wed, 16 May 2012 16:58:52 +0000</pubDate>
		<dc:creator>Himanshu Vashishtha</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HBase features]]></category>
		<category><![CDATA[HBase release]]></category>
		<category><![CDATA[HBase Update]]></category>
		<category><![CDATA[Real-time Hadoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14484</guid>
		<description><![CDATA[Apache HBase 0.94.0 has been released! This is the first major release since the January 22nd HBase 0.92 release. In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes). Performance Related JIRAs Below are a few of the important performance related JIRAs: [...]]]></description>
			<content:encoded><![CDATA[<p>Apache HBase 0.94.0 has been released! This is the first major release since the January 22nd HBase 0.92 release. In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes).</p>
<h2>Performance Related JIRAs</h2>
<p>Below are a few of the important performance related JIRAs:</p>
<ul>
<li title="HBASE-5074"><strong>Read Caching improvements:</strong> HDFS stores data in one block file and its corresponding metadata (checksum) in another block file. This means that every read into the HBase block cache may consume up to two disk ops, one to the datafile and one to the checksum file. <a title="HBASE-5074" href="https://issues.apache.org/jira/browse/HBASE-5074">HBASE-5074</a>: &#8220;Support checksums in HBase block cache&#8221; adds a block level checksum in the HFile itself in order to avoid one disk op,  boosting up the read performance. This feature is <em>enabled</em> by default.</li>
<li><strong>Seek optimizations:</strong> Till now, if there were several StoreFiles for a column family in a region, HBase would seek in each such files and merge the results, even if the row/column we are looking for is in the most recent file.  <a title="HBase-4465" href="https://issues.apache.org/jira/browse/HBASE-4465" target="_blank">HBASE-4465</a>: &#8220;Lazy Seek optimization of StoreFile Scanners&#8221; optimizes scanner reads to read the<em> most recent</em> StoreFile first by <em>lazily seeking</em> the StoreFiles. This is achieved by introducing a fake keyvalue with its timestamp equal to the maximum timestamp present in the particular StoreFile. Thus, a disk seek is avoided until the KeyValueScanner for a StoreFile is bubbled up the heap, implying a need to do a real read operation.  This should provide a significant read performance boost, especially for IncrementColumnValue operations where we care only for latest value. This feature is <em>enabled</em> by default.</li>
<li><strong>Write to WAL optimizations: </strong>HBase write throughput is upper bounded by the write rate of WAL where the log is replicated to a number of datanodes, depending on the replication factor. <a title="HBase-4608" href="https://issues.apache.org/jira/browse/HBASE-4608" target="_blank">HBASE-4608</a>: &#8220;HLog Compression&#8221; adds a custom dictionary-based compression of HLogs for faster replication on HDFS datanodes, thus improving overall write rate for HBase. This feature is considered experimental and is<em> off</em> by default.</li>
</ul>
<h2>New Feature Related JIRAs</h2>
<p>Here is a list of some of the important JIRAs related to adding new features:</p>
<ul>
<li><strong>More powerful first aid box:</strong> The previous HBck tool did a good job of fixing inconsistencies related to region assignments but lacked some basic features like fixing orphaned regions, region holes, overlapping regions, etc. <a href="https://issues.apache.org/jira/browse/HBASE-5128" target="_blank">HBASE-5128: &#8220;Uber hbck&#8221;</a>, adds these missing features to the first aid box.</li>
<li><strong>Simplified Region Sizing:</strong> Deciding a region size is always tricky as it varies on a number of dynamic parameters such as data size, cluster size, workload, etc. <a title="HBase-4365" href="https://issues.apache.org/jira/browse/HBASE-4365" target="_blank">HBASE-4365</a>: &#8220;Heuristic for Region size&#8221; adds a heuristic where it increases the split size threshold of a table region as the data grows, thus limiting the number of region splits.</li>
<li><strong>Smarter transaction semantics: </strong>Though HBase supports single row level transaction, if there are a number of updates (Puts/Deletes) to an individual row, it will lock the row for each of these operations.<a title="HBase-3584" href="https://issues.apache.org/jira/browse/HBASE-3584" target="_blank"> HBASE-3584</a>: &#8220;Atomic Put &amp; Delete in a single transaction&#8221; enhances the HBase single row locking semantics by allowing Puts and Deletes on a row to be executed in a single call. This feature is <em>on</em> by default.</li>
</ul>
<p>This major release has a number of new features and bug fixes; a total of <a title="397 resolved jiras" href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;jqlQuery=project+%3D+HBASE+AND+fixVersion+%3D+%220.94.0%22+AND+resolution+%3D+Fixed+ORDER+BY+priority+DESC&amp;mode=hide" target="_blank">397 resolved JIRAs</a> with 140 enhancements and 180 bug fixes. It is compatible with 0.92. This opens up a window of opportunity to backport some of the cool features back in CDH4, which is based on the 0.92 branch.</p>
<h2>Acknowledgements</h2>
<p>Thanks to everyone who contributed to this release and a hat tip to Lars Hofhansl of Salesforce for being the release manager.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/apache-hbase-0-94-is-now-released/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>How Treato Analyzes Health-related Social Media Big Data with Hadoop and HBase</title>
		<link>http://www.cloudera.com/blog/2012/05/treato-analyzes-health-related-big-data-with-hadoop/</link>
		<comments>http://www.cloudera.com/blog/2012/05/treato-analyzes-health-related-big-data-with-hadoop/#comments</comments>
		<pubDate>Thu, 03 May 2012 13:00:51 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[CDH]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Use Case]]></category>
		<category><![CDATA[Cloudera Case Study]]></category>
		<category><![CDATA[Hadoop Case Study]]></category>
		<category><![CDATA[Hadoop in Healthcare]]></category>
		<category><![CDATA[hadoop use case]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14627</guid>
		<description><![CDATA[This is a guest post by Assaf Yardeni, Head of R&#38;D for Treato, an online social healthcare solution, headquartered in Israel. Three years ago I joined Treato, a social healthcare analysis firm to help treato.com scale up to its present capability. Treato is a new source for healthcare information where health-related user generated content (UGC) [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is a guest post by Assaf Yardeni, Head of R&amp;D for Treato, an online social healthcare solution, headquartered in Israel. </em></p>
<p>Three years ago I joined <a href="http://treato.com/" target="_blank">Treato</a>, a social healthcare analysis firm to help <a href="http://www.treato.com/" target="_blank">treato.com</a> scale up to its present capability. Treato is a new source for healthcare information where health-related user generated content (UGC) from the Internet is aggregated and organized into usable insights for patients, physicians and other healthcare professionals. With oceans of patient-written health-related information available on the Web, and more being published each day, Treato needs to be able to collect and process vast amounts of data – Treato is Big Data par excellence, and my job has been to bring Treato to this stage.</p>
<h2 style="font-size: 14pt; color: #243543;">Before the Hadoop era</h2>
<p>When I arrived at Treato, the team had already developed a Microsoft-based prototype that could organize a limited amount of health-related UGC into relevant insights, as a proof of concept. The system would:</p>
<ul>
<li>Crawl the Web and fetch raw HTML sources,</li>
<li>Extract the user-generated content (i.e. user’s posts) out of the raw sources,</li>
<li>Extract concepts from the posts and index them,</li>
<li>Execute semantic analysis on the posts using natural language processing (NLP) algorithms</li>
<li>And calculate statistics.</li>
</ul>
<p>The prototype was able to prove the initial hypothesis that relevant medical insights can be found in social media, you just have to know how to analyze it. We collected data from dozens of websites and individual social media posts in the tens of millions. We had a handful of text analysis algorithms and could only process a couple million posts per day, but the results were impressive. We found that we were able to identify side effects through social media long before initial FDA or pharmaceutical companies issued warnings about them. For example, when we looked at the discussions about Singulair – an asthma medication – we found that almost half of the user generated content discussed mental disorders. When we looked back through the historical data, we learned that this would have been identifiable in our data four years before the official warning.</p>
<p>In order to gain even more health-related insights, we knew we needed a solution that could crawl and process a larger quantity of data – larger by an order of magnitude. That was the point at which Web scale joined the game. In order to collect massive amounts of posts, we needed to add thousands of data sources. And, of course, all the data we collected would need to be analyzed.</p>
<p>Dealing with a few dozen websites was difficult and costly. But we were able to scale up our Microsoft code to handle collection from a several hundred sites, and could process around 250 million posts. We were running a few old IBM boxes that did the collection work and had developed a job manager that administered crawling and fetching tasks. Different servers ran the indexing and the stats calculations, and we had developed a distributed job manager to direct task executions. Different servers were used for serving the data. We didn&#8217;t have any storage solution, and all of the boxes worked with local drives.</p>
<p>Besides the fact that administering the process was hell, it was expensive in terms of CPU, network and input/output (I/O); e.g., after each stage, the data needed to be moved to a different server for the next stage. In addition, our job manager didn’t deal with failures; every time a task failed we needed to handle it manually. Needless to say, supporting collection and analysis of thousands of websites would have been impossible using this approach.</p>
<h2 style="font-size: 14pt; color: #243543;">Looking at scale</h2>
<p>In the beginning of 2010, we started searching for solutions that could support the capabilities we wanted. The requirements included:</p>
<ol>
<li>Reliable and scalable storage.</li>
<li>Reliable and scalable processing infrastructure.</li>
<li>Search engine capabilities (for retrieving posts) with high availability (HA).</li>
<li>Scalable real-time store for retrieving stats, with HA.</li>
</ol>
<p>We wanted the ability to periodically reprocess the data in a timely manner, so new algorithms or other analysis improvements would take effect on all historical data.</p>
<p>We wanted to know how much it costs to deal with X number of posts, and to be able to scale according to this formula.</p>
<p>We wanted a technology and architecture that would scale with the business.</p>
<p>We searched for answers to questions such as: &#8220;How does Google do it?” and it didn&#8217;t take too long to find Google&#8217;s papers, documentation on Hadoop and MapReduce, and so on.</p>
<p>We started digging deeper in these areas. After a short investigation, it was clear that the Hadoop Distributed File System (HDFS) would support our storage demands, and MapReduce would be a good fit for the processing infrastructure.</p>
<h2 style="font-size: 14pt; color: #243543;">First Hadoop cluster in the lab</h2>
<p>While looking for Hadoop distributions, I encountered <a href="http://www.cloudera.com/hadoop/">Cloudera&#8217;s Distribution including Apache Hadoop</a> (CDH), however, I decided to start with a manual installation since this usually helps me better understand how things work. We started a pilot, setting up a 2 node cluster on Linux boxes. As mentioned, the first installation was done totally manually using the binaries downloaded from Apache, and gently configuring the system. This process was ugly: I needed to download all sorts of binaries from different sources, deal with networking issues, exchange of SSH keys between the nodes, formatting the FS and all sorts of OS tweaks.</p>
<p>We started testing the behavior of the new technology, first with some simple WordCount and pi calculations, and then we quickly wrote MapReduce (Java) code that did parts of our processing and tested it on real HTML sources. The little cluster just worked: I was able to submit jobs &amp; monitor them; I tested recovery from task failures, crash of a node, etc.</p>
<p>Next, I wanted to see how this Hadoop solution scaled. To do this, I installed an additional box and added it to our little Hadoop cluster. It was awesome: after adding the new slave to the cluster, everything was transparent. Suddenly we had more capacity on the file system and more horsepower for processing. The job submission was the same as before; the job submitter (Hadoop client) didn&#8217;t even know that the cluster had changed, it simply got the results quicker. We were able to crunch some numbers and got a dollar-per-post cost.</p>
<p>So, the evaluation was great, but still there was the awful installation and maintenance process. That’s when we started to test <a href="http://www.cloudera.com/hadoop/">Cloudera&#8217;s Distribution including Apache Hadoop</a>; I think it was version 2 of CDH back then. We re-installed our little cluster from scratch using this Hadoop distribution. The installation process was much easier, and the documentation helped. The setup took only a couple of hours. (CDH3 takes less than an hour). </p>
<p>After we found a good package, we wanted to set up a bigger cluster for prototyping, and deeper tests and evaluations. Amazon seemed to be the perfect place for that. Using CDH we set up a 10 node (small instances) cluster on EC2. This was used for performance evaluation and the processing rate was about 10M-20M posts per day &#8212; approximately 6 times higher than the performance from our pre-Hadoop solution.</p>
<p>We decided to go with Hadoop. This was a dramatic decision, as we took a company with a Microsoft-oriented development team, ported all the code into Java, all the while adopting a new and very complicated technology stack. This actually meant starting implementation from the beginning, opening a new integrated development environment (IDE) and starting to code from scratch. </p>
<p>In order to reduce risks and avoid critical mistakes, we searched for someone who has &#8220;been there, done that&#8221; so we could learn from them and validate our overall planned new architecture. Cloudera was our first choice; it made sense to go with a company that has multiple setups behind them, some of which are very large clusters. Cloudera sent Solutions Architect, Lars George, to our offices for two days, and we gave him our suggested design in advance. We felt lucky to have Lars, an HBase committer and author of <a href="http://shop.oreilly.com/product/0636920014348.do" target="_blank"><em>HBase: The Definitive Guide</em></a>,<em> </em>as our consultant since HBase was one of the core technologies we were using.</p>
<p>For the first implementation phase, we decided to go with HDFS, MapReduce &amp; HBase. Our in-house-developed crawlers were using HBase as the store for the list of URLs to be fetched. This table should be able to scale to billions of rows. The fetcher (the component in charge of fetching the raw HTML sources) gets the URL queues out of HBase, runs HTTP requests, and stores the raw HTML sources in large files on top of HDFS (few gigs per file). Both the crawler and fetcher don’t use any relational database or any other kind of store except HDFS &amp; HBase. These two components are network and I/O intensive, but CPU is not much of an issue.</p>
<p>Next comes the processing. Each line in the HDFS files contains an HTML source and metadata related to this source. For each directory of files in HDFS, the following processing jobs need to be executed:</p>
<ol>
<li>Turn the unstructured HTML into a list of post entities (content, timestamp, etc.)</li>
<li>Each post needs to be processed as follows:</li>
<ul>
<li>Index key terms – extract medical concepts out of the post content, using Treato&#8217;s extensive knowledge base</li>
<li>Execute text analysis algorithms</li>
</ul>
<li>Calculate all statistics and update the HBase stats tables.</li>
<li>Post all documents (user’s posts) into our search engine (Solr).</li>
</ol>
<p>During this process, many database queries and updates are needed. For example, each post retrieved may potentially already exist in our system, and of course we don&#8217;t want to add a duplicate post to our system, nor invest processing power on documents we already have. In order to accomplish this, we need to calculate a hash for each post, and then check it against a database containing all of the existing hashes. For this purpose HBase works perfectly in terms of both latency and load.</p>
<p>After the design phase, we started implementation. All R&amp;D teams worked on porting their code into Java, and our Ops team worked on planning the data center (we decided on co-location data center setup).</p>
<p>For the initial setup, we had 11 boxes that comprised our Hadoop cluster, two of which were name nodes in an active / passive mode (one was in standby for manual failover in case the active NameNode failed). Nine nodes were slaves hosting DataNodes, TaskTrackers and Region-Servers daemons. In addition to this we had three boxes running Zookeeper services.</p>
<p>The new system was capable of analyzing 50M posts per day. This was a significant performance improvement. In addition, it was reasonable to maintain, reliable and worked quite smoothly. Of course, there were bumps in the road, but in the end we managed to overcome them all.</p>
<p>We have continued to improve and expand the solution, and today we can process 150 – 200 million user posts per day. In subsequent blog posts, I will share in greater detail our system design, use of HBase, and cluster architecture.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/treato-analyzes-health-related-big-data-with-hadoop/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>HBaseCon 2012: A Glimpse into the Operations Track</title>
		<link>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-operations-track/</link>
		<comments>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-operations-track/#comments</comments>
		<pubDate>Wed, 25 Apr 2012 13:00:03 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HBase Conference]]></category>
		<category><![CDATA[HBase Event]]></category>
		<category><![CDATA[HBaseCon]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14471</guid>
		<description><![CDATA[HBaseCon 2012 is only a month away! The conference takes place May 22 in San Francisco, California and the event is poised to sell out. For those unfamiliar with the Apache HBase project, HBase is open source software that allows for real-time random read/write access to your Big Data in Apache Hadoop with very low [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hbasecon.com/">HBaseCon 2012</a> is only a month away! The conference takes place May 22 in San Francisco, California and the event is poised to sell out.</p>
<div style="float: right; padding-left: 12px; padding-top: 16px;"><a href="http://www.hbasecon.com" target="_blank"><img src="http://www.cloudera.com/wp-content/uploads/2012/03/HBaseCon2012-logo-300px.jpg" alt="HBaseCon 2012" width="200px" /></a></div>
<p>For those unfamiliar with the Apache HBase project, HBase is open source software that allows for real-time random read/write access to your Big Data in Apache Hadoop with very low latency and high scalability. Presentations in the HBaseCon 2012 Operations track will explain the state of HBase today, how to mitigate HBase failures, and best practices in cluster deployment and cluster monitoring.</p>
<h2 style="font-size: 18pt;">Operations Track Presentations</h2>
<p style="padding-top: 8px;"><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">Case Study of HBase Operations at Facebook</span></a><br /> <a href="http://www.hbasecon.com/speakers/ryan-thiessen/">Ryan Thiessen</a>, Facebook</p>
<p>At Facebook we have demanding HBase installations which are used for important and real-time user activity, so failure in an HBase cluster can be a serious issue requiring immediate attention. This session will discuss a variety of real-world scenarios where we have had failures in our HBase systems, how our Operations and Engineering teams have worked to mitigate many of these issues, and where HBase still needs to improve instead of relying on workarounds. The database should never go down. This talk is aimed at developers and other users of HBase (both current and potential) who are interested in an operational perspective on the state of HBase today.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">HBase Backup</span></a><br /> <a href="http://www.hbasecon.com/speakers/sunil-sitaula/">Sunil Sitaula</a>, Cloudera<br /><a href="http://www.hbasecon.com/speakers/madhuwanti-vaidya/">Madhuwanti Vaidya</a>, Facebook</p>
<p>Reliable backup and recovery is one of the main requirements for any enterprise grade applications. HBase has been very well embraced by enterprises needing random, real-time read/write access with huge volumes of data and ease of scalability. As such they are looking for backup solutions that are reliable, easy to use, and can work with existing infrastructure. HBase comes with several backup options but there is a clear need to improve the native export mechanisms. This talk will cover various options that are available out of the box, their drawbacks and what various companies are doing to make backup and recovery efficient. In particular it will cover what Facebook has done to improve performance of backup and recovery process with minimal impact to production cluster.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">HBase Security for the Enterprise</span></a><br /> <a href="http://www.hbasecon.com/speakers/andrew-purtell/">Andrew Purtell</a>, Trend Micro</p>
<p>Trend Micro developed the new security features in HBase 0.92 and has the first known deployment of secure HBase in production. We will share our motivations, use cases, experiences, and provide a 10 minute tutorial on how to set up a test secure HBase cluster and a walk through of a simple usage example. The tutorial will be carried out live on an on-demand EC2 cluster, with a video backup in case of network or EC2 unavailability.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">Lightning Talk | Developing Real Time Analytics Applications Using HBase in the Cloud</span></a><br /> <a href="http://www.hbasecon.com/speakers/rick-tucker/">Rick Tucker</a>, Sproxil</p>
<p>As small companies are adapting to handle Big Data, the cloud and HBase enable developers to leverage that data to provide revenue generating real-time applications. When developing a real-time application for an existing system, one must balance incrementing counters in real-time with MapReduce jobs over the same data-set. When maintaining an analytics platform, ensuring data accuracy is essential. At Sproxil, SMS logs are ingested into HBase at a growing rate and we report metrics such as SMS throughput, unique user growth over time, and return SMS user activity in real time. Sproxil provides a versatile analytics application enabling customers to handpick statistics on demand to gain market insights enabling them to react quickly to trends. This talk will identify the most profitable metrics and demonstrate how to calculate them using Map Reduce while continually updating data as it arrives.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">Lightning Talk | Unique Sets on HBase and Hadoop</span></a><br /> <a href="http://www.hbasecon.com/speakers/elliott-clark/">Elliott Clark</a>, StumbleUpon</p>
<p>Determining the number of unique users that have interacted with a web page, game, or application is a very common use case. HBase is becoming an increasingly accepted tool for calculating sets or counts of unique individuals who meet some criteria. Computing these statistics can range in difficulty from very simple to very difficult. This session will explore how different approaches have worked or not worked at scale for counting uniques on HBase with Hadoop.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">Lightning Talk | Orchestrating Clusters with Ironfan and Chef</span></a><br /> <a href="http://www.hbasecon.com/speakers/robert-berger/">Robert Berger</a>, Runa</p>
<p>This session will discuss how you can represent your complete cluster with one config file and have it deployed to Cloud or Bare Metal. Infochmimps’ Ironfan builds on Opscode Chef to allow you to specify and orchestrate all flavors of your cluster’s deployment, monitoring and growth. Not just the core HBase/HDFS/MapReduce/Hive/Flume, etc. but all the elements including web / app servers, mysql, redis, rabbitmq and whatever other servers needed to implement your service. These same tools can manage variations for development, staging, R&amp;D as well as the target “rendering” to various Clouds, Bare Metal or even Vagrant VMs.</p>
<p><a href="http://hbaseconsf.eventbrite.com/" target="_blank"><img src="http://www.hbasecon.com/wp-content/uploads/2012/02/btn-register-small.png" alt="Register for HBaseCon 2012" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-operations-track/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>HBaseCon 2012: A Glimpse into the Development Track</title>
		<link>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-development-track/</link>
		<comments>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-development-track/#comments</comments>
		<pubDate>Thu, 12 Apr 2012 22:46:41 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[hbase community]]></category>
		<category><![CDATA[HBase Conference]]></category>
		<category><![CDATA[HBase Event]]></category>
		<category><![CDATA[HBaseCon]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14331</guid>
		<description><![CDATA[HBaseCon 2012 is nearly a month away, and if the conference agenda and attendee registration numbers are good indicators, this will be an annual event you won’t want to miss. Apache HBase is an open source software project that provides users with the ability to do real-time random read/write access to their data in Apache Hadoop. [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hbasecon.com/" target="_blank" title="HBaseCon 2012">HBaseCon 2012</a> is nearly a month away, and if the <a href="http://www.hbasecon.com/agenda" title="HBaseCon 2012 Agenda" target="_blank">conference agenda</a> and attendee registration numbers are good indicators, this will be an annual event you won’t want to miss.</p>
<p>Apache HBase is an open source software project that provides users with the ability to do real-time random read/write access to their data in Apache Hadoop. This means that when you want to use Hadoop for real-time data processing, HBase is the project you are looking for. The HBase developer community includes contributors from many organizations such as StumbleUpon, Facebook, Salesforce.com, TrendMicro, eBay, Explorys, Huawei and Cloudera. In fact, the <a href="http://www.hbasecon.com/program-committee" target="_blank" title="HBaseCon 2012 Program Committee">HBaseCon Program Committee</a>, constructors of the <a href="http://www.hbasecon.com/agenda" title="HBaseCon 2012 Agenda" target="_blank">HBaseCon 2012 agenda</a>, are all committers and PMC members of the Apache HBase project.</p>
<div style="float:right;padding-left:12px"><a href="http://www.hbasecon.com" target="_blank"><img src="http://www.cloudera.com/wp-content/uploads/2012/03/HBaseCon2012-logo-300px.jpg" alt="HBaseCon 2012" width="200px" /></a></div>
<p>Presentations in the HBaseCon 2012 Development track will explain how and why HBase is built the way it is and will also cover HBase schema design and HDFS, the file system on which HBase is most commonly deployed. Some of the presentations for this track include the following below.</p>
<h2 style="font-size:16pt">Development Track Presentations</h2>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Learning HBase Internals</span></a><br />
<a href="http://www.hbasecon.com/speakers/lars-hofhansl/">Lars Hofhansl</a>, Salesforce.com</p>
<p>The strength of an open source project resides entirely in its developer community; a strong democratic culture of participation and hacking makes for a better piece of software. The key requirement is having developers who are not only willing to contribute, but also knowledgeable about the project’s internal structure and architecture. This session will introduce developers to the core internal architectural concepts of HBase, not just “what” it does from the outside, but “how” it works internally, and “why” it does things a certain way. We’ll walk through key sections of code and discuss key concepts like the MVCC implementation and memstore organization. The goal is to convert serious “HBase Users” into HBase Developer Users,” and give voice to some of the deep knowledge locked in the committers’ heads.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Lessons learned from OpenTSDB</span></a><br />
<a href="http://www.hbasecon.com/speakers/benoit-sigoure/">Benoit Sigoure</a>, StumbleUpon</p>
<p>OpenTSDB was built on the belief that, through HBase, a new breed of monitoring systems could be created, one that can store and serve billions of data points forever without the need for destructive downsampling, one that could scale to millions of metrics, and where plotting real-time graphs is easy and fast. In this presentation we’ll review some of the key points of OpenTSDB’s design, some of the mistakes that were made, how they were or will be addressed, and what were some of the lessons learned while writing and running OpenTSDB as well as asynchbase, the asynchronous high-performance thread-safe client for HBase. Specific topics discussed will be around the schema, how it impacts performance and allows concurrent writes without need for coordination in a distributed cluster of OpenTSDB instances.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">HBase Schema Design</span></a><br />
<a href="http://www.hbasecon.com/speakers/ian-varley/">Ian Varley</a>, Salesforce.com</p>
<p>Most developers are familiar with the topic of “database design.” In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">HBase and HDFS: Past, Present, and Future</span></a><br />
<a href="http://www.hbasecon.com/speakers/todd-lipcon/">Todd Lipcon</a>, Cloudera</p>
<p>Apache HDFS, the file system on which HBase is most commonly deployed, was originally designed for high-latency high-throughput batch analytic systems like MapReduce. Over the past two to three years, the rising popularity of HBase has driven many enhancements in HDFS to improve its suitability for real-time systems, including durability support for write-ahead logs, high availability, and improved low-latency performance. This talk will give a brief history of some of the enhancements from Hadoop 0.20.2 through 0.23.0, discuss some of the most exciting work currently under way, and explore some of the future enhancements we expect to develop in the coming years. We will include both high-level overviews of the new features as well as practical tips and benchmark results from real deployments.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Lightning Talk | Relaxed Transactions for HBase<br />
<a href="http://www.hbasecon.com/speakers/francis-liu/">Francis Liu</a>, Yahoo!</p>
<p>For Map/Reduce programmers used to HDFS, the mutability of HBase tables poses new challenges: Data can change over the duration of a job, multiple jobs can write concurrently, writes are effective immediately, and it is not trivial to clean up partial writes. Revision Manager introduces atomic commits and point-in-time consistent snapshots over a table, guaranteeing repeatable reads and protection from partial writes. Revision Manager is optimized for a relatively small number of concurrent write jobs, which is typical within Hadoop clusters. This session will discuss the implementation of Revision Manager using ZooKeeper and coprocessors, and paying extra care to ensure security in multi-tenant clusters. Revision Manager is available as part of the HBase storage handler in HCatalog, but can easily be used stand-alone with little coding effort.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Lightning Talk | Living Data: Applying Adaptable Schemas to HBase<br />
<a href="http://www.hbasecon.com/speakers/aaron-kimball/">Aaron Kimball</a>, WibiData</p>
<p>HBase application developers face a number of challenges: schema management is performed at the application level, decoupled components of a system can break one another in unexpected ways, less-technical users cannot easily access data, and evolving data collection and analysis needs are difficult to plan for. In this talk, we describe a schema management methodology based on Apache Avro that enables users and applications to share data in HBase in a scalable, evolvable fashion. By adopting these practices, engineers independently using the same data have guarantees on how their applications interact. As data collection needs change, applications are resilient to drift in the underlying data representation. This methodology results in a data dictionary that allows less-technical users to understand what data is available to them for analysis and inspect data using general-purpose tools (for example, export it via Sqoop to an RDBMS). And because of Avro’s cross-language capabilities, HBase’s power can reach new domains, like web apps built in Ruby.</p>
<p>&nbsp;</p>
<div> </div>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-development-track/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>HBase Hackathon at Cloudera</title>
		<link>http://www.cloudera.com/blog/2012/04/hbase-hackathon-at-cloudera/</link>
		<comments>http://www.cloudera.com/blog/2012/04/hbase-hackathon-at-cloudera/#comments</comments>
		<pubDate>Fri, 06 Apr 2012 23:32:08 +0000</pubDate>
		<dc:creator>David S. Wang</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Apache HBase]]></category>
		<category><![CDATA[HBase Hackathon]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14127</guid>
		<description><![CDATA[Cloudera will be hosting an Apache HBase hackathon on May 23rd, 2012, the day after HBaseCon 2012.  The overall theme of the event will be 0.96 stabilization.  If you are in the area for HBaseCon, please come down to our offices in Palo Alto the next day to attend the hackathon.  This is a great [...]]]></description>
			<content:encoded><![CDATA[<p>Cloudera will be hosting an Apache HBase <a title="HBase hackathon Meetup page " href="http://www.meetup.com/hackathon/events/58953522/" target="_blank">hackathon</a> on May 23rd, 2012, the day after <a title="HBaseCon 2012" href="http://hbasecon.com" target="_blank">HBaseCon 2012</a>.  The overall theme of the event will be 0.96 stabilization.  If you are in the area for HBaseCon, please come down to our offices in Palo Alto the next day to attend the hackathon.  This is a great opportunity to contribute some code towards the project and hang out with other HBasers.</p>
<p>More details are on the hackathon&#8217;s <a title="HBase hackathon Meetup page" href="http://www.meetup.com/hackathon/events/58953522/" target="_blank">Meetup</a> page.  Please RSVP so we can better plan lunch, room size, and other logistics for the event.  See you there!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/04/hbase-hackathon-at-cloudera/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>HBaseCon 2012: A Glimpse into the Applications Track</title>
		<link>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-applications-track/</link>
		<comments>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-applications-track/#comments</comments>
		<pubDate>Wed, 04 Apr 2012 13:00:32 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HBaseCon]]></category>
		<category><![CDATA[Hbasecon sessions]]></category>
		<category><![CDATA[hbasecon talks]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14074</guid>
		<description><![CDATA[HBaseCon 2012 is coming to San Francisco on May 22, less than 2 months away! The conference agenda continues to grow daily with exciting presentation content, which means it’s time to share a few sessions that have been added to the HBaseCon 2012 Applications Track. Apache HBase is primarily used for real-time random read/write access [...]]]></description>
			<content:encoded><![CDATA[<div style="float:left;padding-right:20px"><a href="http://www.hbasecon.com"><img src="http://www.cloudera.com/wp-content/uploads/2012/03/HBaseCon2012-logo-300px.jpg" alt="HBaseCon 2012" width="200px" /></a></div>
<p style="padding-left:220px"><a href="http://www.hbasecon.com/">HBaseCon 2012</a> is coming to San Francisco on May 22, less than 2 months away! The <a href="http://www.hbasecon.com/agenda">conference agenda</a> continues to grow daily with exciting presentation content, which means it’s time to share a few sessions that have been added to the HBaseCon 2012 Applications Track.</p>
<p>Apache HBase is primarily used for real-time random read/write access to Big Data as part of the Hadoop ecosystem. Applications on Apache HBase are typically built to query Big Data with extremely low latency. Sessions in the HBaseCon 2012 Applications Tracks will include explanations of real-world HBase use cases, where HBase fits in an organization’s entire Big Data stack and when HBase is the “right” solution for an organization.</p>
<h2 style="font-size:16pt">Applications Track Presentations</h2>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Building a Large Search Platform on a Shoestring Budget</span></a><br />
<a href="http://www.hbasecon.com/speakers/jacques-nadeau/">Jacques Nadeau</a>, CTO at YapMap</p>
<p>YapMap is a new kind of search platform that does multi-quanta search to better understand threaded discussions. This talk will cover how HBase made it possible for two self-funded guys to build a new kind of search platform. The presentation will discuss the YapMap data model and how YapMap uses row based atomicity to manage parallel data integration problems. Also learn where YapMap does not use HBase and instead uses a traditional SQL based infrastructure; the benefits of using MapReduce and HBase for index generation; the YapMap migration of tasks from a message based queue to the Coprocessor framework; and YapMap’s future Coprocessor use cases. Lastly, learn about YapMap’s operational experience with HBase, hardware choices and the challenges YapMap has faced.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Low Latency “OLAP” with HBase</span></a><br />
<a href="http://www.hbasecon.com/speakers/cosmin-lehene/">Cosmin Lehene</a>, Computer Scientist at Adobe Systems</p>
<p>Adobe Systems uses “SaasBase Analytics” to incrementally process large heterogeneous data sets into pre-aggregated, indexed views, stored in HBase to be queried in real- time. The goal was to process new data in real- time (currently minutes) and have it ready for a large number of concurrent queries that execute in milliseconds. This set Adobe’s problem apart from what is traditionally solved with Hive or Pig. This talk will describe the design and the strategies (and hacks) used to achieve low latency and scalability, from theoretical model to the entire process of ETL to warehousing and queries.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Growing Your Inbox, HBase at Tumblr</span></a><br />
<a href="http://www.hbasecon.com/speakers/blake-matheny/">Blake Matheny</a>, Director of Platform Engineering at Tumblr</p>
<p>This talk goes into detail about Tumblr’s experience developing Motherboy, an eventually consistent inbox style storage system built around HBase. The SLA, write concurrency, data volume, and failure modes for this application created a number of challenges in developing a solution. The user homing scheme introduced additional complexity that made capacity planning tricky as Tumblr tried to trade off availability and cost. Performance testing of our workload, and automation to support that testing, also provided a number of valuable lessons. This talk will be most useful to people considering HBase for their application, but will have enough detail to be useful to current HBase users as well.</p>
<p><a href="http://hbaseconsf.eventbrite.com/" target="_blank"><img src="http://www.hbasecon.com/wp-content/uploads/2012/02/btn-register-small.png" alt="Register for HBaseCon 2012" /></a></p>
<p>Be sure to check the <a href="http://www.hbasecon.com/agenda">agenda</a> in the coming weeks as we are adding more sessions soon. Remember that the Early Bird registration price expires this Friday April 6 so register soon to take advantage of the discount.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-applications-track/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>March 2012 Bay Area HBase User Group meetup summary</title>
		<link>http://www.cloudera.com/blog/2012/03/march-2012-bay-area-hbase-user-group-meetup-summary/</link>
		<comments>http://www.cloudera.com/blog/2012/03/march-2012-bay-area-hbase-user-group-meetup-summary/#comments</comments>
		<pubDate>Fri, 30 Mar 2012 16:36:07 +0000</pubDate>
		<dc:creator>David S. Wang</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Apache Meetup]]></category>
		<category><![CDATA[HBase Meetup]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=13898</guid>
		<description><![CDATA[The Bay Area HBase User Group March 2012 meetup was held at the StumbleUpon offices in San Francisco, California. 80 interested HBasers were in attendance to mingle and listen to the scheduled presentations. Michael Stack started the meetup by reminding folks to register for HBaseCon 2012 in San Francisco on May 22nd.  Nick Dimiduk and Cloudera&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>The <a title="Bay Area HBase User Group homepage" href="http://www.meetup.com/hbaseusergroup/" target="_blank">Bay Area HBase User Group</a> March 2012 <a title="Bay Area HBase User Group March 2012 meetup homepage" href="http://www.meetup.com/hbaseusergroup/events/56021562/" target="_blank">meetup</a> was held at the StumbleUpon offices in San Francisco, California. <strong>80</strong> interested HBasers were in attendance to mingle and listen to the scheduled presentations.</p>
<p><strong>Michael Stack</strong> started the meetup by reminding folks to register for <a title="HBaseCon 2012" href="http://hbasecon.com" target="_blank">HBaseCon 2012</a> in San Francisco on May 22nd.  <strong>Nick Dimiduk</strong> and Cloudera&#8217;s <strong>Amandeep Khurana</strong> then announced an early access program for their upcoming book, <a title="HBase In Action homepage, including early access program details" href="http://www.manning.com/dimidukkhurana" target="_blank">HBase In Action</a>.  Interested folks can get a discount for the program by using the code &#8220;hbase38.&#8221;</p>
<p>St.Ack then discussed various recent releases (<a title="Bay Area HBase User Group March 2012 introductory slides" href="http://files.meetup.com/1350427/20120327hbase_meetup.pdf" target="_blank">link to slides</a>):</p>
<ul>
<li>0.90.6 (<a href="http://www.cloudera.com/blog/2012/03/apache-hbase-0-90-6-is-now-available/">link to previous blog</a>) and 0.92.1 (<a href="http://www.cloudera.com/blog/2012/03/apache-hbase-0-92-1-now-available/">link to previous blog</a>) were officially released in mid-late March.</li>
<li>0.94.0 is a performance-oriented release currently in its first RC. Many improvements from Facebook and others made it into the release, which should support rolling restart and be backwards compatible with 0.92.</li>
<li>Trunk is now 0.96. It will contain changes to enable wire compatibility, and will <em>not</em> be backwards-compatible with previous releases; hence its nickname of the &#8220;singularity&#8221;. However, releases after 0.96 should be compatible within one major release.</li>
</ul>
<p>Folks then commenced their presentations:</p>
<h2>Moving HBase RPC to protobufs</h2>
<p><strong>Jimmy Xiang</strong> &amp; <strong>Gregory Chanan</strong> from Cloudera talked about the background, motivation, and goals for the ongoing Apache HBase wire compatibility effort.  They also focused on the requirements, compability matrix, proposal and work breakdown. <a title="Bay Area HBase User Group March 2012 wire compatibility presentation" href="http://files.meetup.com/1350427/wire-compat%20%281%29.pptx" target="_blank">Slides</a> are available.</p>
<h2>Comparing the native HBase client and asynchbase</h2>
<p>StumbleUpon&#8217;s <strong>Benoît Sigoure</strong> gave an overview of <a title="asynchhbase github homepage" href="https://github.com/stumbleupon/asynchbase" target="_blank">asynchbase</a>, a fully non-blocking HBase client. He touched upon its features, how it is used at StumbleUpon and gave some encouraging performance results.</p>
<h2>Using Apache Hive with HBase: Recent improvements</h2>
<p>Finally, <strong>Enis Soztutar</strong> from Hortonworks gave a presentation about Hive and HBase integration. He covered architecture and some use cases, along with some discussion of schemas and data type mappings.</p>
<p>Afterwards, <strong>Shaneal Manek</strong> from Cloudera and <strong>Jesse Yates</strong> from Salesforce held a huddle about backups and snapshots in HBase. A summary is presented in <a title="HBase backup/snapshot huddle summary" href="https://issues.apache.org/jira/browse/HBASE-50?focusedCommentId=13240157&amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13240157" target="_blank">HBASE-50</a>.</p>
<p>Please note that an Apache HBase PMC meeting took place right before the meetup; comprehensive <a title="Apache HBase PMC meeting minutes" href="https://blogs.apache.org/hbase/entry/hbase_project_management_committee_meeting" target="_blank">minutes</a> are available.</p>
<p>Thanks to StumbleUpon for hosting the meetup and providing the food and drink!</p>
<p style="text-align:center;padding-top:12px"><a href="http://www.hbasecon.com" target="_blank"><img src="http://www.cloudera.com/wp-content/uploads/2012/03/HBaseCon2012-logo-300px.jpg" alt="HBaseCon 2012" width="200px" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/03/march-2012-bay-area-hbase-user-group-meetup-summary/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Apache HBase 0.92.1 now available</title>
		<link>http://www.cloudera.com/blog/2012/03/apache-hbase-0-92-1-now-available/</link>
		<comments>http://www.cloudera.com/blog/2012/03/apache-hbase-0-92-1-now-available/#comments</comments>
		<pubDate>Fri, 23 Mar 2012 23:26:23 +0000</pubDate>
		<dc:creator>Shaneal Manek</dc:creator>
				<category><![CDATA[HBase]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=13770</guid>
		<description><![CDATA[What&#8217;s new? Apache HBase 0.92.1 is now available. This release is a marked improvement in system correctness, availability, and ease of use. It&#8217;s also backwards compatible with 0.92.0 — except for the removal of the rarely-used transform functionality from the REST interface in HBASE-5228. Apache HBase 0.92.1 is a bug fix release covering 61 issues &#8211; [...]]]></description>
			<content:encoded><![CDATA[<h2>What&#8217;s new?</h2>
<p><a href="http://hbase.apache.org/">Apache HBase</a> 0.92.1 is now available. This release is a marked improvement in system correctness, availability, and ease of use. It&#8217;s also backwards compatible with 0.92.0 — except for the removal of the rarely-used transform functionality from the REST interface in <a href="https://issues.apache.org/jira/browse/HBASE-5228">HBASE-5228</a>.</p>
<p>Apache HBase 0.92.1 is a bug fix release covering <a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hide&amp;requestId=12319405">61 issues</a> &#8211; including 6 blockers and 6 critical issues, such as:</p>
<ul>
<li><a href="https://issues.apache.org/jira/browse/HBASE-5121">HBASE-5121</a> which ensures scans work properly while major compactions are occurring</li>
<li>Several fixes that prevent crashes in edge-case situations (<a href="https://issues.apache.org/jira/browse/HBASE-5279">HBASE-5279</a>, <a href="https://issues.apache.org/jira/browse/HBASE-4890">HBASE-4890</a>, and <a href="https://issues.apache.org/jira/browse/HBASE-5415">HBASE-5415</a>)</li>
<li>Fixing the build system so the release includes <a href="https://issues.apache.org/jira/browse/HBASE-5288">security sources</a> and <a href="https://issues.apache.org/jira/browse/HBASE-5294">Javadocs</a></li>
<li><a href="https://issues.apache.org/jira/browse/HBASE-5267">HBASE-5267</a> which fixes some configuration problems with the <a href="http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/">slab cache</a></li>
</ul>
<div> </div>
<h2>Acknowledgements:</h2>
<p>A big thanks to our tireless release manager, Michael Stack, and everyone who contributed to the release (reporting issues, fixing bugs, reviewing changes, writing documentation, etc).</p>
<hr />
<p style="float: left; padding-right: 12px; padding-top: 12px;"><a href="http://www.hbasecon.com" target="_blank"><img src="http://www.cloudera.com/wp-content/uploads/2012/03/HBaseCon2012-logo-300px.jpg" alt="HBaseCon 2012" /></a></p>
<p style="padding-bottom: 70px;">Cloudera is hosting the first ever <a title="HBaseCon 2012" href="http://www.hbasecon.com">HBase conference</a> at the InterContinental hotel in San Francisco on May 22, 2012. Register by April 6 to catch the early bird price!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/03/apache-hbase-0-92-1-now-available/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Apache HBase 0.90.6 is now available</title>
		<link>http://www.cloudera.com/blog/2012/03/apache-hbase-0-90-6-is-now-available/</link>
		<comments>http://www.cloudera.com/blog/2012/03/apache-hbase-0-90-6-is-now-available/#comments</comments>
		<pubDate>Mon, 19 Mar 2012 22:50:45 +0000</pubDate>
		<dc:creator>Jimmy Xiang</dc:creator>
				<category><![CDATA[HBase]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=10764</guid>
		<description><![CDATA[Apache HBase 0.90.6 is now available. It is a bug fix release covering 31 bugs and 5 improvements.  Among them, 3 are blockers and 3 are critical, such as: HBASE-5008: HBase can not provide services to a region when it can&#8217;t flush the region, but considers it stuck in flushing, HBASE-4773: HBaseAdmin may leak ZooKeeper [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://hbase.apache.org/">Apache HBase</a> 0.90.6 is now available. It is a bug fix release covering <a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;jqlQuery=project+%3D+HBASE+AND+fixVersion+%3D+%220.90.6%22+AND+resolution+%3D+Fixed+ORDER+BY+priority+DESC&amp;mode=hide" target="_blank">31 bugs and 5 improvements</a>.  Among them, 3 are blockers and 3 are critical, such as:</p>
<ul>
<li><a title="HBASE-5008" href="https://issues.apache.org/jira/browse/HBASE-5008" target="_blank">HBASE-5008</a>: <em></em>HBase can not provide services to a region when it can&#8217;t flush the region, but considers it stuck in flushing,</li>
<li><a title="HBASE-4773" href="https://issues.apache.org/jira/browse/HBASE-4773" target="_blank">HBASE-4773</a>: HBaseAdmin may leak ZooKeeper connections,</li>
<li><a title="HBASE-5060" href="https://issues.apache.org/jira/browse/HBASE-5060" target="_blank">HBASE-5060</a>: HBase client may be blocked forever when there is a temporary network failure.</li>
</ul>
<p>This release has improved system robustness and availability by fixing bugs that cause potential data loss, system unavailability, possible deadlocks, read inconsistencies and resource leakage.</p>
<p>The 0.90.6 release is backward compatible with 0.90.5. The fixes in this release will be included in CDH3u4.</p>
<h2 style="font-size:12pt">Acknowledgements:</h2>
<p>A special thanks to everyone who contributed to the release (reporting issues, fixing bugs, reviewing changes, writing documentation, etc).</p>
<hr />
<p style="float:left;padding-right:12px;padding-top:12px"><a href="http://www.hbasecon.com" target="_blank"><img src="http://www.cloudera.com/wp-content/uploads/2012/03/HBaseCon2012-logo-300px.jpg" alt="HBaseCon 2012" /></a></p>
<p style="padding-bottom:70px">Cloudera is hosting the first ever <a href="http://www.hbasecon.com" title="HBaseCon 2012">HBase conference</a> at the InterContinental hotel in San Francisco on May 22, 2012. Register by April 6 to catch the early bird price! </p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/03/apache-hbase-0-90-6-is-now-available/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>HBase + Hadoop + Xceivers</title>
		<link>http://www.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/</link>
		<comments>http://www.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/#comments</comments>
		<pubDate>Wed, 14 Mar 2012 17:00:14 +0000</pubDate>
		<dc:creator>Lars George</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=13470</guid>
		<description><![CDATA[Introduction Some of the configuration properties found in Hadoop have a direct effect on clients, such as HBase. One of those properties is called &#8220;dfs.datanode.max.xcievers&#8221;, and belongs to the HDFS subproject. It defines the number of server side threads and &#8211; to some extent &#8211; sockets used for data connections. Setting this number too low can [...]]]></description>
			<content:encoded><![CDATA[<h2>Introduction</h2>
<p>Some of the configuration properties found in Hadoop have a direct effect on clients, such as HBase. One of those properties is called &#8220;dfs.datanode.max.xcievers&#8221;, and belongs to the HDFS subproject. It defines the number of server side threads and &#8211; to some extent &#8211; sockets used for data connections. Setting this number too low can cause problems as you grow or increase utilization of your cluster. This post will help you to understand what happens between the client and server, and how to determine a reasonable number for this property.</p>
<h2>The Problem</h2>
<p>Since HBase is storing everything it needs inside HDFS, the hard upper boundary imposed by the &#8221;dfs.datanode.max.xcievers&#8221; configuration property can result in too few resources being available to HBase, manifesting itself as IOExceptions on either side of the connection. Here is an example from the HBase mailing list [1], where the following messages were initially logged on the RegionServer side: </p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">2008-11-11 19:55:52,451 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Could not read from stream<br />2008-11-11 19:55:52,451 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-5467014108758633036_595771<br />2008-11-11 19:55:58,455 WARN org.apache.hadoop.dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.<br />2008-11-11 19:55:58,455 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block blk_-5467014108758633036_595771 bad datanode[0]<br />2008-11-11 19:55:58,482 FATAL org.apache.hadoop.hbase.regionserver.Flusher: Replay of hlog required. Forcing server shutdown</p>
<p style="padding-top:12px">Correlating this with the Hadoop DataNode logs revealed the following entry:</p>
<p style="font-family: 'Courier New', Courier, mono;font-size: small; background-color:#CEE9FF;">ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.10.10.53:50010,storageID=DS-1570581820-10.10.10.53-50010-1224117842339,infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: xceiverCount 258 exceeds the limit of concurrent xcievers 256  </p>
<p style="padding-top:12px">In this example, the low value of &#8220;dfs.datanode.max.xcievers&#8221; for the DataNodes caused the entire RegionServer to shut down. This is a really bad situation. Unfortunately, there is no hard-and-fast rule that explains how to compute the required limit. It is commonly advised to raise the number from the default of 256 to something like 4096 (see [1], [2], [3], [4], and [5] for reference). This is done by adding this property to the hdfs-site.xml file of all DataNodes (note that it is misspelled): </p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">&lt;property&gt;    &lt;name&gt;dfs.datanode.max.xcievers&lt;/name&gt;<br />    &lt;value&gt;4096&lt;/value&gt;<br />  &lt;/property&gt;</p>
<p style="padding-top:12px">Note: You will need to restart your DataNodes after making this change to the configuration file.</p>
<p>This should help with the above problem, but you still might want to know more about how this all plays together, and what HBase is doing with these resources. We will discuss this in the remainder of this post. But before we do, we need to be clear about why you cannot simply set this number very high, say 64K and be done with it.</p>
<p>There is a reason for an upper boundary, and it is twofold: first, threads need their own stack, which means they occupy memory. For current servers this means 1MB per thread[6] by default. In other words, if you use up all the 4096 DataXceiver threads, you need around 4GB of heap to accommodate them. This cuts into the space you have assigned for memstores and block caches, as well as all the other moving parts of the JVM. In a worst case scenario, you might run into an OutOfMemoryException, and the RegionServer process is toast. You want to set this property to a reasonably high number, but not too high either.</p>
<p>Second, having these many threads active you will also see your CPU becoming increasingly loaded. There will be many context switches happening to handle all the concurrent work, which takes away resources for the real work. As with the concerns about memory, you want the number of threads not grow boundlessly, but provide a reasonable upper boundary &#8211; and that is what &#8220;dfs.datanode.max.xcievers&#8221; is for.</p>
<h2>Hadoop File System Details</h2>
<p>From the client side, the HDFS library is providing the abstraction called Path. This class represents a file in a file system supported by Hadoop, represented by the FileSystem class. There are a few concrete implementation of the abstract FileSystem class, one of which is the DistributedFileSytem, representing HDFS. This class in turn wraps the actual DFSClient class that handles all interactions with the remote servers, i.e. the NameNode and the many DataNodes.</p>
<p>When a client, such as HBase, opens a file, it does so by, for example, calling the open() or create() methods of the FileSystem class, here the most simplistic incarnations</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">  public DFSInputStream open(String src) throws IOException<br />  public FSDataOutputStream create(Path f) throws IOException</p>
<p style="padding-top:12px">The returned stream instance is what needs a server-side socket and thread, which are used to read and write blocks of data. They form part of the contract to exchange data between the client and server. Note that there are other, RPC-based protocols in use between the various machines, but for the purpose of this discussion they can be ignored.</p>
<p>The stream instance returned is a specialized DFSOutputStream or DFSInputStream class, which handle all of the interaction with the NameNode to figure out where the copies of the blocks reside, and the data communication per block per DataNode.</p>
<p>On the server side, the DataNode wraps an instance of DataXceiverServer, which is the actual class that reads the above configuration key and also throws the above exception when the limit is exceeded.</p>
<p>When the DataNode starts, it creates a thread group and starts the mentioned DataXceiverServer instance like so:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">  this.threadGroup = new ThreadGroup(&#8220;dataXceiverServer&#8221;);<br />  this.dataXceiverServer = new Daemon(threadGroup,<br />      new DataXceiverServer(ss, conf, this));<br />  this.threadGroup.setDaemon(true); // auto destroy when empty </p>
<p style="padding-top:12px">Note that the DataXceiverServer thread is already taking up one spot of the thread group. The DataNode also has this internal class to retrieve the number of currently active threads in this group:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">  /** Number of concurrent xceivers per node. */<br />  int getXceiverCount() {<br />    return threadGroup == null ? 0 : threadGroup.activeCount();<br />  }</p>
<p style="padding-top:12px">Reading and writing blocks, as initiated by the client, causes for a connection to be made, which is wrapped by the DataXceiverServer thread into a DataXceiver instance. During this hand off, a thread is created and registered in the above thread group. So for every active read and write operation a new thread is tracked on the server side. If the count of threads in the group exceeds the configured maximum then the said exception is thrown and recorded in the DataNode&#8217;s logs:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">  if (curXceiverCount > dataXceiverServer.maxXceiverCount) {<br />    throw new IOException(&#8220;xceiverCount &#8221; + curXceiverCount<br />                          + &#8221; exceeds the limit of concurrent xcievers &#8220;<br />                          + dataXceiverServer.maxXceiverCount);<br />  }</p>
<h2 style="padding-top:12px">Implications for Clients</h2>
<p>Now, the question is, how does the client reading and writing relate to the server side threads. Before we go into the details though, let&#8217;s use the debug information that the DataXceiver class logs when it is created and closed</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">  LOG.debug(&#8220;Number of active connections is: &#8221; + datanode.getXceiverCount());<br />  &#8230;<br />  LOG.debug(datanode.dnRegistration + &#8220;:Number of active connections is: &#8220;     + datanode.getXceiverCount());</p>
<p style="padding-top:12px">and monitor during a start of HBase what is logged on the DataNode. For simplicity&#8217;s sake this is done on a pseudo distributed setup with a single DataNode and RegionServer instance. The following shows the top of the RegionServer&#8217;s status page.</p>
<p><a href="http://www.cloudera.com/wp-content/uploads/2012/05/HadoopHBaseXceiverScreen1.png"><img class="alignnone size-full wp-image-13480" src="http://www.cloudera.com/wp-content/uploads/2012/05/HadoopHBaseXceiverScreen1.png" alt="" width="545" height="294" /></a> </p>
<p>The important part is in the &#8220;Metrics&#8221; section, where it says &#8220;storefiles=22&#8243;. So, assuming that HBase has at least that many files to handle, plus some extra files for the write-ahead log, we should see the above logs message state that we have at least 22 &#8220;active connections&#8221;. Let&#8217;s start HBase and check the DataNode and RegionServer log files:</p>
<p>Command Line:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">$ bin/start-hbase.sh<br />&#8230;</p>
<p style="padding-top:12px">DataNode Log:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">2012-03-05 13:01:35,309 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 1<br />2012-03-05 13:01:35,315 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 2<br />12/03/05 13:01:35 INFO regionserver.MemStoreFlusher: globalMemStoreLimit=396.7m, globalMemStoreLimitLowMark=347.1m, maxHeap=991.7m<br />12/03/05 13:01:39 INFO http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 60030<br />2012-03-05 13:01:40,003 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 1<br />12/03/05 13:01:40 INFO regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052<br />2012-03-05 13:01:40,882 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 3<br />2012-03-05 13:01:40,884 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 4<br />2012-03-05 13:01:40,888 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 3<br />&#8230;<br />12/03/05 13:01:40 INFO regionserver.HRegion: Onlined -ROOT-,,0.70236052; next sequenceid=63083<br />2012-03-05 13:01:40,982 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 3<br />2012-03-05 13:01:40,983 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 4<br />&#8230;<br />12/03/05 13:01:41 INFO regionserver.HRegionServer: Received request to open region: .META.,,1.1028785192<br />2012-03-05 13:01:41,026 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 3<br />2012-03-05 13:01:41,027 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 4<br />&#8230;<br />12/03/05 13:01:41 INFO regionserver.HRegion: Onlined .META.,,1.1028785192; next sequenceid=63082<br />2012-03-05 13:01:41,109 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 3<br />2012-03-05 13:01:41,114 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 4<br />2012-03-05 13:01:41,117 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 5<br />12/03/05 13:01:41 INFO regionserver.HRegionServer: Received request to open 16 region(s)<br />12/03/05 13:01:41 INFO regionserver.HRegionServer: Received request to open region: usertable,,1330944810191.62a312d67981c86c42b6bc02e6ec7e3f.<br />12/03/05 13:01:41 INFO regionserver.HRegionServer: Received request to open region: usertable,user1120311784,1330944810191.90d287473fe223f0ddc137020efda25d.<br />&#8230;</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">2012-03-05 13:01:41,246 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 6<br />2012-03-05 13:01:41,248 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 7<br />&#8230;<br />2012-03-05 13:01:41,257 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 10<br />2012-03-05 13:01:41,257 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 9<br />&#8230;<br />12/03/05 13:01:41 INFO regionserver.HRegion: Onlined usertable,user1120311784,1330944810191.90d287473fe223f0ddc137020efda25d.; next sequenceid=62917<br />12/03/05 13:01:41 INFO regionserver.HRegion: Onlined usertable,,1330944810191.62a312d67981c86c42b6bc02e6ec7e3f.; next sequenceid=62916<br />&#8230;<br />12/03/05 13:01:41 INFO regionserver.HRegion: Onlined usertable,user1361265841,1330944811370.80663fcf291e3ce00080599964f406ba.; next sequenceid=62919<br />2012-03-05 13:01:41,474 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 6<br />2012-03-05 13:01:41,491 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 7<br />2012-03-05 13:01:41,495 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 8<br />2012-03-05 13:01:41,508 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 7<br />&#8230;<br />12/03/05 13:01:41 INFO regionserver.HRegion: Onlined usertable,user1964968041,1330944848231.dd89596e9129e1caa7e07f8a491c9734.; next sequenceid=62920<br />2012-03-05 13:01:41,618 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 6<br />2012-03-05 13:01:41,621 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 7<br />&#8230;<br />2012-03-05 13:01:41,829 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 7<br />12/03/05 13:01:41 INFO regionserver.HRegion: Onlined usertable,user515290649,1330944849739.d23924dc9e9d5891f332c337977af83d.; next sequenceid=62926<br />2012-03-05 13:01:41,832 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 6<br />2012-03-05 13:01:41,838 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 7<br />12/03/05 13:01:41 INFO regionserver.HRegion: Onlined usertable,user757669512,1330944850808.cd0d6f16d8ae9cf0c9277f5d6c6c6b9f.; next sequenceid=62929<br />&#8230;<br />2012-03-05 14:01:39,711 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 4<br />2012-03-05 22:48:41,945 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 4<br />12/03/05 22:48:41 INFO regionserver.HRegion: Onlined usertable,user757669512,1330944850808.cd0d6f16d8ae9cf0c9277f5d6c6c6b9f.; next sequenceid=62929<br />2012-03-05 22:48:41,963 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 4</p>
<p style="padding-top:12px">You can see how the regions are opened one after the other, but what you also might notice is that the number of active connections never climbs to 22 &#8211; it barely even reaches 10. Why is that? To understand this better, we have to see how files in HDFS map to the server-side DataXceiver&#8217;s instance &#8211; and the actual threads they represent. </p>
<h2>Hadoop Deep Dive</h2>
<p>The aforementioned DFSInputStream and DFSOutputStream are really facades around the usual stream concepts. They wrap the client-server communication into these standard Java interfaces, while internally routing the traffic to a selected DataNode &#8211; which is the one that holds a copy of the current block. It has the liberty to open and close these connection as needed. As a client reads a file in HDFS, the client library classes switch transparently from block to block, and therefore from DataNode to DataNode, so it has to open and close connections as needed. </p>
<p>The DFSInputStream has an instance of a DFSClient.BlockReader class, that opens the connection to the DataNode. The stream instance calls blockSeekTo() for every call to read() which takes care of opening the connection, if there is none already. Once a block is completely read the connection is closed. Closing the stream has the same effect of course. </p>
<p>The DFSOutputStream has a similar helper class, the DataStreamer. It tracks the connection to the server, which is initiated by the nextBlockOutputStream() method. It has further internal classes that help with writing the block data out, which we omit here for the sake of brevity.</p>
<p>Both writing and reading blocks requires a thread to hold the socket and intermediate data on the server-side, wrapped in the DataXceiver instance. Depending what your client is doing, you will see the number of connections fluctuate around the number of currently accessed files in HDFS.</p>
<p>Back to the HBase riddle above: the reason you do not see up to 22 (and more) connections during the start is that while the regions open, the only required data is the HFile&#8217;s info block. This block is read to gain vital details about each file, but then closed again. This means that the server-side resource is released in quick succession. The remaining four connections are harder to determine. You can use JStack to dump all threads on the DataNode, which in this example shows this entry:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">&#8220;DataXceiver for client /127.0.0.1:64281 [sending block blk_5532741233443227208_4201]&#8221; daemon prio=5 tid=7fb96481d000 nid=0x1178b4000 runnable [1178b3000]<br />   java.lang.Thread.State: RUNNABLE<br />   &#8230;</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">&#8220;DataXceiver for client /127.0.0.1:64172 [receiving block blk_-2005512129579433420_4199 client=DFSClient_hb_rs_10.0.0.29,60020,1330984111693_1330984118810]&#8221; daemon prio=5 tid=7fb966109000 nid=0x1169cb000 runnable [1169ca000]<br />   java.lang.Thread.State: RUNNABLE<br />   &#8230;</p>
<p style="padding-top:12px">These are the only DataXceiver entries (in this example), so the count in the thread group is a bit misleading. Recall that the DataXceiverServer daemon thread already accounts for one extra entry, which combined with the two above accounts for the three active connections &#8211; which in fact means three active threads. The reason the log states four instead, is that it logs the count from an active thread that is about to finish. So, shortly after the count of four is logged, it is actually one less, i.e. three and hence matching our head count of active threads.</p>
<p>Also note that the internal helper classes, such as the PacketResponder occupy another thread in the group while being active. The JStack output does indicate that fact, listing the thread as such:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;"> &#8221;PacketResponder 0 for Block blk_-2005512129579433420_4199&#8243; daemon prio=5 tid=7fb96384d000 nid=0x116ace000 in Object.wait() [116acd000]<br />   java.lang.Thread.State: TIMED_WAITING (on object monitor)<br />     at java.lang.Object.wait(Native Method)<br />     at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder \<br />       .lastDataNodeRun(BlockReceiver.java:779)<br />     - locked <7bc79c030> (a org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder)<br />     at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:870)<br />     at java.lang.Thread.run(Thread.java:680)</p>
<p style="padding-top:12px">This thread is currently in TIMED_WAITING state and is not considered active. That is why the count emitted by the DataXceiver log statements is not including these kind of threads. If they become active due to the client sending sending data, the active thread count will go up again. Another thing to note its that this thread does not need a separate connection, or socket, between the client and the server. The PacketResponder is just a thread on the server side to receive block data and stream it to the next DataNode in the write pipeline.</p>
<p>The Hadoop fsck command also has an option to report what files are currently open for writing:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">$ hadoop fsck /hbase -openforwrite<br />FSCK started by larsgeorge from /10.0.0.29 for path /hbase at Mon Mar 05 22:59:47 CET 2012<br />&#8230;&#8230;/hbase/.logs/10.0.0.29,60020,1330984111693/10.0.0.29%3A60020.1330984118842 0 bytes, 1 block(s), OPENFORWRITE: &#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;..Status: HEALTHY<br /> Total size:     2088783626 B<br /> Total dirs:     54<br /> Total files:    45<br /> &#8230;</p>
<p>This does not immediately relate to an occupied server-side thread, as these are allocated by block ID. But you can glean from it, that there is one open block for writing. The Hadoop command has additional options to print out the actual files and block ID they are comprised of:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">$ hadoop fsck /hbase -files -blocks<br />FSCK started by larsgeorge from /10.0.0.29 for path /hbase at Tue Mar 06 10:39:50 CET 2012</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">&#8230;<br />/hbase/.META./1028785192/.tmp &lt;dir&gt;<br />/hbase/.META./1028785192/info &lt;dir&gt;<br />/hbase/.META./1028785192/info/4027596949915293355 36517 bytes, 1 block(s):  OK<br />0. blk_5532741233443227208_4201 len=36517 repl=1</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">&#8230;<br />Status: HEALTHY<br /> Total size:     2088788703 B<br /> Total dirs:     54<br /> Total files:     45 (Files currently being written: 1)<br /> Total blocks (validated):     64 (avg. block size 32637323 B) (Total open file blocks (not validated): 1)<br /> Minimally replicated blocks:     64 (100.0 %)<br /> &#8230;</p>
<p style="padding-top:12px">This gives you two things. First, the summary states that there is one open file block at the time the command ran &#8211; matching the count reported by the &#8220;-openforwrite&#8221; option above. Secondly, the list of blocks next to each file lets you match the thread name to the file that contains the block being accessed. In this example the block with the ID &#8220;blk_5532741233443227208_4201&#8243; is sent from the server to the client, here a RegionServer. This block belongs to the HBase .META. table, as shown by the output of the Hadoop fsck command. The combination of JStack and fsck can serve as a poor mans replacement for lsof (a tool on the Linux command line to &#8220;list open files&#8221;).</p>
<p>The JStack also reports that there is a DataXceiver thread, with an accompanying PacketResponder, for block ID &#8220;blk_-2005512129579433420_4199&#8243;, but this ID is missing from the list of blocks reported by fsck. This is because the block is not yet finished and therefore not available to readers. In other words, Hadoop fsck only reports on complete (or synced[7][8], for Hadoop version that support this feature) blocks. </p>
<h2>Back to HBase</h2>
<p>Opening all the regions does not need as many resources on the server as you would have expected. If you scan the entire HBase table though, you force HBase to read all of the blocks in all HFiles: </p>
<p>HBase Shell:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">hbase(main):003:0> scan &#8216;usertable&#8217;<br />&#8230;<br />1000000 row(s) in 1460.3120 seconds</p>
<p style="padding-top:12px">DataNode Log:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">2012-03-05 14:42:20,580 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 6<br />2012-03-05 14:43:23,293 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 7<br />2012-03-05 14:43:23,299 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 8<br />&#8230;<br />2012-03-05 14:49:24,332 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 11<br />2012-03-05 14:49:24,332 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 10<br />2012-03-05 14:49:59,987 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 11<br />2012-03-05 14:51:12,603 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 12<br />2012-03-05 14:51:12,605 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 11<br />2012-03-05 14:51:46,473 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 12<br />&#8230;<br />2012-03-05 14:56:59,420 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 15<br />2012-03-05 14:57:31,722 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 16<br />2012-03-05 14:58:24,909 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 17<br />2012-03-05 14:58:24,910 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 16<br />&#8230;<br />2012-03-05 15:04:17,688 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 21<br />2012-03-05 15:04:17,689 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 22<br />2012-03-05 15:04:54,545 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 21<br />2012-03-05 15:05:55,901 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 22<br />2012-03-05 15:05:55,901 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 21</p>
<p style="padding-top:12px">The number of active connections reaches the elusive 22 now. Note that this count already includes the server thread, so we are still a little short of what we could consider the theoretical maximum &#8211; based on the number of files HBase has to handle.</p>
<h2>What does that all mean?</h2>
<p>So, how many &#8220;xcievers (sic)&#8221; do you need? Given you only use HBase, you could simply monitor the above &#8220;storefiles&#8221; metric (which you get also through Ganglia or JMX) and add a few percent for intermediate and write-ahead log files. This should work for systems in motion. However, if you were to determine that number on an idle, fully compacted system and assume it is the maximum, you might find this number being too low once you start adding more store files during regular memstore flushes, i.e. as soon as you start to add data to the HBase tables. Or if you also use MapReduce on that same cluster, Flume log aggregation, and so on. You will need to account for those extra files, and, more importantly, open blocks for reading and writing. </p>
<p>Note again that the examples in this post are using a single DataNode, something you will not have on a real cluster. To that end, you will have to divide the total number of store files (as per the HBase metric) by the number of DataNodes you have. If you have, for example, a store file count of 1000, and your cluster has 10 DataNodes, then you should be OK with the default of 256 xceiver threads per DataNode.</p>
<p>The worst case would be the number of all active readers and writers, i.e. those that are currently sending or receiving data. But since this is hard to determine ahead of time, you might want to consider building in a decent reserve. Also, since the writing process needs an extra &#8211; although shorter lived &#8211; thread (for the PacketResponder) you have to account for that as well. So a reasonable, but rather simplistic formula could be:</p>
<p> <a href="http://www.cloudera.com/wp-content/uploads/2012/03/HadoopHBaseXceiverFormula1.png"><img class="alignnone  wp-image-13479" src="http://www.cloudera.com/wp-content/uploads/2012/03/HadoopHBaseXceiverFormula1.png" alt="" width="433" height="47" /></a></p>
<p>This formula takes into account that you need about two threads for an active writer and another for an active reader. This is then summed up and divided by the number of DataNodes, since you have to specify the &#8220;dfs.datanode.max.xcievers&#8221; per DataNode.</p>
<p>If you loop back to the HBase RegionServer screenshot above, you saw that there were 22 store files. These are immutable and will only be read, or in other words occupy one thread only. For all memstores that are flushed to disk you need two threads &#8211; but only until they are fully written. The files are finalized and closed for good, cleaning up any thread in the process. So these come and go based on your flush frequency. Same goes for compactions, they will read N files and write them into a single new one, then finalize the new file. As for the write-ahead logs, these will occupy a thread once you have started to add data to any table. There is a log file per server, meaning that you can only have twice as many active threads for these files as you have RegionServers.</p>
<p>For a pure HBase setup (HBase plus its own HDFS, with no other user), we can estimate the number of needed DataXceiver&#8217;s with the following formula:</p>
<p><a href="http://www.cloudera.com/wp-content/uploads/2012/03/HadoopHBaseXceiverFormula2.png"><img class="alignnone  wp-image-13478" src="http://www.cloudera.com/wp-content/uploads/2012/03/HadoopHBaseXceiverFormula2.png" alt="" width="782" height="47" /></a></p>
<p>Since you will be hard pressed to determine the <em>active</em> number of store files, flushes, and so on, it might be better to estimate the theoretical maximum instead. This maximum value takes into account that you can only have a single flush and compaction active per region at any time. The maximum number of logs you can have active matches the number of RegionServers, leading us to this formula:</p>
<p>  <a href="http://www.cloudera.com/wp-content/uploads/2012/03/HadoopHBaseXceiverFormula31.png"><img class="alignnone  wp-image-13572" src="http://www.cloudera.com/wp-content/uploads/2012/03/HadoopHBaseXceiverFormula31.png" alt="" width="581" height="49" /></a></p>
<p>Obviously, the number of store files will increase over time, and the number of regions typically as well. Same for the numbers of servers, so keep in mind to adjust this number over time. In practice, you can add a buffer of, for example, 20%, as shown in the formula below &#8211; in an attempt to not force you to change the value too often. </p>
<p>On the other hand, if you keep the number of regions fixed per server[9], and rather split them manually, while adding new servers as you grow, you should be able to keep this configuration property stable for each server.</p>
<h2>Final Advice &amp; TL;DR</h2>
<p>Here is the final formula you want to use:</p>
<p><a href="http://www.cloudera.com/wp-content/uploads/2012/05/HadoopHBaseXceiverFormula4.png"><img class="alignnone  wp-image-13570" src="http://www.cloudera.com/wp-content/uploads/2012/05/HadoopHBaseXceiverFormula4.png" alt="" width="611" height="47" /></a></p>
<p>It computes the maximum number of threads needed, based on your current HBase vitals (no. of store files, regions, and region servers). It also adds a fudge factor of 20% to give you room for growth. Keep an eye on the numbers on a regular basis and adjust the value as needed. You might want to use Nagios with appropriate checks to warn you when any of the vitals goes over a certain percentage of change.</p>
<p>Note: Please make sure you also adjust the number of file handles your process is allowed to use accordingly[10]. This affects the number of sockets you can use, and if that number is too low (default is often 1024), you will get connection issues first. </p>
<p>Finally, the engineering devil on one of your shoulders should already have started to snicker about how horribly non-Erlang-y this is, and how you should use an event driven approach, possibly using Akka with Scala[11] &#8211; if you want to stay within the JVM world. Bear in mind though that the clever developers in the community share the same thoughts and have already started to discuss various approaches[12][13]. </p>
<h2>Links:</h2>
<ul>
<li>[1] <a href="http://old.nabble.com/Re%3A-xceiverCount-257-exceeds-the-limit-of-concurrent-xcievers-256-p20469958.html">http://old.nabble.com/Re%3A-xceiverCount-257-exceeds-the-limit-of-concurrent-xcievers-256-p20469958.html</a></li>
<li>[2] <a href="http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html">http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html</a></li>
<li>[3] <a href="https://issues.apache.org/jira/browse/HDFS-1861">https://issues.apache.org/jira/browse/HDFS-1861</a> &#8221;Rename dfs.datanode.max.xcievers and bump its default value&#8221;</li>
<li>[4] <a href="https://issues.apache.org/jira/browse/HDFS-1866">https://issues.apache.org/jira/browse/HDFS-1866</a> &#8221;Document dfs.datanode.max.transfer.threads in hdfs-default.xml&#8221;</li>
<li>[5] <a href="http://hbase.apache.org/book.html#dfs.datanode.max.xcievers">http://hbase.apache.org/book.html#dfs.datanode.max.xcievers</a></li>
<li>[6] <a href="http://www.oracle.com/technetwork/java/hotspotfaq-138619.html#threads_oom">http://www.oracle.com/technetwork/java/hotspotfaq-138619.html#threads_oom</a></li>
<li>[7] <a href="https://issues.apache.org/jira/browse/HDFS-200">https://issues.apache.org/jira/browse/HDFS-200</a> &#8221;In HDFS, sync() not yet guarantees data available to the new readers&#8221;</li>
<li>[8] <a href="https://issues.apache.org/jira/browse/HDFS-265">https://issues.apache.org/jira/browse/HDFS-265</a> &#8221;Revisit append&#8221;</li>
<li>[9] <a href="http://search-hadoop.com/m/CBBoV3z24H1">http://search-hadoop.com/m/CBBoV3z24H1</a> &#8221;HBase, mail # user &#8211; region size/count per regionserver&#8221;</li>
<li>[10] <a href="http://hbase.apache.org/book.html#ulimit">http://hbase.apache.org/book.html#ulimit</a> &#8221;ulimit and nproc&#8221;</li>
<li>[11] <a href="http://akka.io/">http://akka.io/</a> &#8221;Akka&#8221;</li>
<li>[12] <a href="https://issues.apache.org/jira/browse/HDFS-223">https://issues.apache.org/jira/browse/HDFS-223</a> &#8221;Asynchronous IO Handling in Hadoop and HDFS&#8221;</li>
<li>[13] <a href="https://issues.apache.org/jira/browse/HDFS-918">https://issues.apache.org/jira/browse/HDFS-918</a> &#8221;Use single Selector and small thread pool to replace many instances of BlockSender for reads&#8221;</li>
</ul>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

