<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; Blog</title>
	<atom:link href="http://www.cloudera.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Thu, 24 May 2012 17:53:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Apache MRUnit Is Now A Top Level Project</title>
		<link>http://www.cloudera.com/blog/2012/05/apache-mrunit-is-now-a-top-level-project/</link>
		<comments>http://www.cloudera.com/blog/2012/05/apache-mrunit-is-now-a-top-level-project/#comments</comments>
		<pubDate>Thu, 24 May 2012 17:00:24 +0000</pubDate>
		<dc:creator>Brock Noland</dc:creator>
				<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Test MapReduce]]></category>
		<category><![CDATA[Unit Testing]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=15124</guid>
		<description><![CDATA[This posted was originally posted to the Apache Software Foundation MRUnit blog. The Apache MRUnit team has graduated from the Apache Incubator to an Apache TLP (Top Level Project)! MRUnit is a Java library that helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall [...]]]></description>
			<content:encoded><![CDATA[<p><em>This posted was originally posted to the <a href="https://blogs.apache.org/mrunit/entry/apache_mrunit_is_now_a" target="_blank">Apache Software Foundation MRUnit blog</a>.</em></p>
<p>The Apache MRUnit team has graduated from the Apache Incubator to an Apache TLP (Top Level Project)! MRUnit is a Java library that helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they&#39;re deployed to a production system.</p>
<p>In its monthly meeting in May of 2012, the board of Apache Software Foundation (ASF) resolved to grant a Top-Level Project status to Apache MRUnit, thus graduating it from the Incubator. This is a significant milestone in the life of MRUnit, which has come a long way since its inception as a Hadoop Contrib project in <a target="_blank" href="https://issues.apache.org/jira/browse/HADOOP-5518" target="_blank">HADOOP-5518</a> contributed by Aaron Kimball.</p>
<ul style="padding-left:12px">
<li>May 2012 MRUnit graduates from the Incubator to become a TLP</li>
<li>May 2012 Version 0.9.0-incubating released.</li>
<li>April 2012 Dave Beech added as a new committer.</li>
<li>April 2012 Jarek Jarcec Cecho added as a new committer.</li>
<li>April 2012 New website created using the CMS.</li>
<li>March 2012 Version 0.8.1-incubating released.</li>
<li>March 2012 Jim Donofrio added as a new committer.</li>
<li>Feburary 2012 Version 0.8.0-incubating released.</li>
<li>November 2011 Version 0.5.0-incubating released.</li>
<li>October 2011 Brock Noland added as a new committer.</li>
<li>March 2011 Project enters incubation.</li>
<li>April 2009 Doug Cutting commits Aaron&#39;s patch to Hadoop</li>
<li>March 2009 Aaron Kimball contributes MRunit to Hadoop as a contrib project</li>
</ul>
<p>Below is the graduation resolution:</p>
<pre class="code">X. Establish the Apache MRUnit Project

WHEREAS, the Board of Directors deems it to be in the best
interests of the Foundation and consistent with the
Foundation&#39;s purpose to establish a Project Management
Committee charged with the creation and maintenance of
open-source software related to unit testing Apache Hadoop map
reduce jobs for distribution at no charge to the public.

NOW, THEREFORE, BE IT RESOLVED, that a Project Management
Committee (PMC), to be known as the &quot;Apache MRUnit Project&quot;,
be and hereby is established pursuant to Bylaws of the
Foundation; and be it further

RESOLVED, that the Apache MRUnit Project be and hereby is
responsible for the creation and maintenance of software
related to unit testing Apache Hadoop map reduce jobs;
and be it further

RESOLVED, that the office of &quot;Vice President, Apache MRUnit&quot; be
and hereby is created, the person holding such office to
serve at the direction of the Board of Directors as the chair
of the Apache MRUnit Project, and to have primary responsibility
for management of the projects within the scope of
responsibility of the Apache MRUnit Project; and be it further

RESOLVED, that the persons listed immediately below be and
hereby are appointed to serve as the initial members of the
Apache MRUnit Project:

* Brock Noland - brock@apache.org
* Patrick Hunt - phunt@apache.org
* Nigel Daley - nigel@apache.org
* Eric Sammer - esammer@apache.org
* Aaron Kimball - kimballa@apache.org
* Konstantin Boudnik - cos@apache.org
* Garrett Wu - gwu@apache.org
* Jim Donofrio - jdonofrio@apache.org
* Jarek Jarcec Cecho - jarcec@apache.org
* Dave Beech - dbeech@apache.org

NOW, THEREFORE, BE IT FURTHER RESOLVED, that Brock Noland
be appointed to the office of Vice President, Apache MRUnit, to
serve in accordance with and subject to the direction of the
Board of Directors and the Bylaws of the Foundation until
death, resignation, retirement, removal or disqualification,
or until a successor is appointed; and be it further

RESOLVED, that the initial Apache MRUnit PMC be and hereby is
tasked with the creation of a set of bylaws intended to
encourage open development and increased participation in the
Apache MRUnit Project; and be it further

RESOLVED, that the Apache MRUnit Project be and hereby
is tasked with the migration and rationalization of the Apache
Incubator MRUnit podling; and be it further

RESOLVED, that all responsibilities pertaining to the Apache
Incubator MRUnit podling encumbered upon the Apache Incubator
Project are hereafter discharged.
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/apache-mrunit-is-now-a-top-level-project/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Apache HBase 0.94 is now released</title>
		<link>http://www.cloudera.com/blog/2012/05/apache-hbase-0-94-is-now-released/</link>
		<comments>http://www.cloudera.com/blog/2012/05/apache-hbase-0-94-is-now-released/#comments</comments>
		<pubDate>Wed, 16 May 2012 16:58:52 +0000</pubDate>
		<dc:creator>Himanshu Vashishtha</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HBase features]]></category>
		<category><![CDATA[HBase release]]></category>
		<category><![CDATA[HBase Update]]></category>
		<category><![CDATA[Real-time Hadoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14484</guid>
		<description><![CDATA[Apache HBase 0.94.0 has been released! This is the first major release since the January 22nd HBase 0.92 release. In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes). Performance Related JIRAs Below are a few of the important performance related JIRAs: [...]]]></description>
			<content:encoded><![CDATA[<p>Apache HBase 0.94.0 has been released! This is the first major release since the January 22nd HBase 0.92 release. In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes).</p>
<h2>Performance Related JIRAs</h2>
<p>Below are a few of the important performance related JIRAs:</p>
<ul>
<li title="HBASE-5074"><strong>Read Caching improvements:</strong> HDFS stores data in one block file and its corresponding metadata (checksum) in another block file. This means that every read into the HBase block cache may consume up to two disk ops, one to the datafile and one to the checksum file. <a title="HBASE-5074" href="https://issues.apache.org/jira/browse/HBASE-5074">HBASE-5074</a>: &#8220;Support checksums in HBase block cache&#8221; adds a block level checksum in the HFile itself in order to avoid one disk op,  boosting up the read performance. This feature is <em>enabled</em> by default.</li>
<li><strong>Seek optimizations:</strong> Till now, if there were several StoreFiles for a column family in a region, HBase would seek in each such files and merge the results, even if the row/column we are looking for is in the most recent file.  <a title="HBase-4465" href="https://issues.apache.org/jira/browse/HBASE-4465" target="_blank">HBASE-4465</a>: &#8220;Lazy Seek optimization of StoreFile Scanners&#8221; optimizes scanner reads to read the<em> most recent</em> StoreFile first by <em>lazily seeking</em> the StoreFiles. This is achieved by introducing a fake keyvalue with its timestamp equal to the maximum timestamp present in the particular StoreFile. Thus, a disk seek is avoided until the KeyValueScanner for a StoreFile is bubbled up the heap, implying a need to do a real read operation.  This should provide a significant read performance boost, especially for IncrementColumnValue operations where we care only for latest value. This feature is <em>enabled</em> by default.</li>
<li><strong>Write to WAL optimizations: </strong>HBase write throughput is upper bounded by the write rate of WAL where the log is replicated to a number of datanodes, depending on the replication factor. <a title="HBase-4608" href="https://issues.apache.org/jira/browse/HBASE-4608" target="_blank">HBASE-4608</a>: &#8220;HLog Compression&#8221; adds a custom dictionary-based compression of HLogs for faster replication on HDFS datanodes, thus improving overall write rate for HBase. This feature is considered experimental and is<em> off</em> by default.</li>
</ul>
<h2>New Feature Related JIRAs</h2>
<p>Here is a list of some of the important JIRAs related to adding new features:</p>
<ul>
<li><strong>More powerful first aid box:</strong> The previous HBck tool did a good job of fixing inconsistencies related to region assignments but lacked some basic features like fixing orphaned regions, region holes, overlapping regions, etc. <a href="https://issues.apache.org/jira/browse/HBASE-5128" target="_blank">HBASE-5128: &#8220;Uber hbck&#8221;</a>, adds these missing features to the first aid box.</li>
<li><strong>Simplified Region Sizing:</strong> Deciding a region size is always tricky as it varies on a number of dynamic parameters such as data size, cluster size, workload, etc. <a title="HBase-4365" href="https://issues.apache.org/jira/browse/HBASE-4365" target="_blank">HBASE-4365</a>: &#8220;Heuristic for Region size&#8221; adds a heuristic where it increases the split size threshold of a table region as the data grows, thus limiting the number of region splits.</li>
<li><strong>Smarter transaction semantics: </strong>Though HBase supports single row level transaction, if there are a number of updates (Puts/Deletes) to an individual row, it will lock the row for each of these operations.<a title="HBase-3584" href="https://issues.apache.org/jira/browse/HBASE-3584" target="_blank"> HBASE-3584</a>: &#8220;Atomic Put &amp; Delete in a single transaction&#8221; enhances the HBase single row locking semantics by allowing Puts and Deletes on a row to be executed in a single call. This feature is <em>on</em> by default.</li>
</ul>
<p>This major release has a number of new features and bug fixes; a total of <a title="397 resolved jiras" href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;jqlQuery=project+%3D+HBASE+AND+fixVersion+%3D+%220.94.0%22+AND+resolution+%3D+Fixed+ORDER+BY+priority+DESC&amp;mode=hide" target="_blank">397 resolved JIRAs</a> with 140 enhancements and 180 bug fixes. It is compatible with 0.92. This opens up a window of opportunity to backport some of the cool features back in CDH4, which is based on the 0.92 branch.</p>
<h2>Acknowledgements</h2>
<p>Thanks to everyone who contributed to this release and a hat tip to Lars Hofhansl of Salesforce for being the release manager.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/apache-hbase-0-94-is-now-released/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Meet the Presenter: Todd Lipcon</title>
		<link>http://www.cloudera.com/blog/2012/05/meet-the-presenter-todd-lipcon/</link>
		<comments>http://www.cloudera.com/blog/2012/05/meet-the-presenter-todd-lipcon/#comments</comments>
		<pubDate>Mon, 14 May 2012 17:44:41 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[Hadoop Summit]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14877</guid>
		<description><![CDATA[Today’s interview features Todd Lipcon, software engineer for Cloudera. Todd will be presenting Optimizing MapReduce Job Performance at Hadoop Summit. Question: Tell us about your current role and how you interact with Apache Hadoop? Todd: I’m a software engineer on Cloudera’s platform engineering team, where I spend most of my time contributing code to open [...]]]></description>
			<content:encoded><![CDATA[<p>Today’s interview features Todd Lipcon, software engineer for Cloudera. Todd will be presenting <a href="http://hadoopsummit.org/program/#session32" target="_blank"><em>Optimizing MapReduce Job Performance</em></a> at Hadoop Summit.</p>
<h2>Question: Tell us about your current role and how you interact with Apache Hadoop?</h2>
<p><strong>Todd:</strong> I’m a software engineer on Cloudera’s platform engineering team, where I spend most of my time contributing code to open source projects like Apache Hadoop and HBase. Most recently I’ve been implementing the automatic HA failover feature in Hadoop 2.0, but I’ve also spent a lot of time working on understanding and improving performance of the Hadoop stack.</p>
<h2>Question: Tell us about your Hadoop Summit presentation?</h2>
<p><strong>Todd:</strong> At this year’s summit, I will be presenting about the internals of MapReduce and how you can tune your MapReduce jobs for optimal performance. A lot of developers see MapReduce as a black box, but looking inside that box can help you understand where you might have bottlenecks or easy opportunities to improve performance by changing a few configuration parameters.</p>
<h2>Question: What do you expect will be the key takeaway for folks attending your session?</h2>
<p><strong>Todd:</strong> I hope attendees will walk away with a better understanding of each of the phases of MapReduce task execution, and a few key configuration parameters they can play with to get better performance without changing their code.</p>
<h2>Question: What other presentations are you most looking forward to attending?</h2>
<p><strong>Todd:</strong> I’m really looking forward to Josh Wills’ talk on BranchReduce: Distributed Branch-and-Bound on YARN. There are a lot of optimization problems which can be solved by branch-and-bound approaches, and it’s only recently with the introduction of YARN that these types of algorithms can be efficiently built on Hadoop. Not only this a fresh topic, Josh is also an entertaining speaker!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/meet-the-presenter-todd-lipcon/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cloudera Manager 4.0 Beta released</title>
		<link>http://www.cloudera.com/blog/2012/05/cloudera-manager-4-0-beta-released/</link>
		<comments>http://www.cloudera.com/blog/2012/05/cloudera-manager-4-0-beta-released/#comments</comments>
		<pubDate>Mon, 14 May 2012 13:00:12 +0000</pubDate>
		<dc:creator>Aparna Ramani</dc:creator>
				<category><![CDATA[cloudera manager]]></category>
		<category><![CDATA[general]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14875</guid>
		<description><![CDATA[We&#8217;re happy to announce the Beta release of Cloudera Manager 4.0.  This version of Cloudera Manager includes support for CDH4 Beta2 and several new features for both the Free edition and the Enterprise edition. Please try it out and send your comments to beta@cloudera.com. As always, we look forward to your feedback. ]]></description>
			<content:encoded><![CDATA[<p>We&#8217;re happy to announce the Beta release of Cloudera Manager 4.0. </p>
<p>This version of Cloudera Manager includes support for <a title="Introducing CDH4 Beta 2" href="http://www.cloudera.com/blog/2012/04/introducing-cdh4-beta-2/" target="_blank">CDH4 Beta2</a> and several new features for both the <a title="Free edition" href="https://ccp.cloudera.com/display/FREE400BETA/New+Features+in+Cloudera+Manager+Free+Edition+4.0" target="_blank">Free edition</a> and the <a title="Enterprise edition" href="https://ccp.cloudera.com/display/ENT400BETA/New+Features+in+Cloudera+Manager+4.0" target="_blank">Enterprise edition</a>.</p>
<p>Please <a title="try it out" href="https://ccp.cloudera.com/display/SUPPORT/Downloads" target="_blank">try it out</a> and send your comments to beta@cloudera.com. As always, we look forward to your feedback. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/cloudera-manager-4-0-beta-released/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>CDH3 update 4 is now available</title>
		<link>http://www.cloudera.com/blog/2012/05/cdh3-update-4-is-now-available/</link>
		<comments>http://www.cloudera.com/blog/2012/05/cdh3-update-4-is-now-available/#comments</comments>
		<pubDate>Wed, 09 May 2012 22:13:01 +0000</pubDate>
		<dc:creator>David S. Wang</dc:creator>
				<category><![CDATA[CDH]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[#cdh3]]></category>
		<category><![CDATA[cloudera hadoop distribution]]></category>
		<category><![CDATA[hadoop distribuition]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14723</guid>
		<description><![CDATA[We are happy to officially announce the general availability of CDH3 update 4. This update consists primarily of reliability enhancements as well as a number of minor improvements. First, there have been a few notable HBase updates. In this release, we&#8217;ve upgraded Apache HBase to upstream version 0.90.6, improving system robustness and availability. Also, some [...]]]></description>
			<content:encoded><![CDATA[<p>We are happy to officially announce the general availability of CDH3 update 4. This update consists primarily of reliability enhancements as well as a number of minor improvements.</p>
<p>First, there have been a few notable HBase updates. In this release, we&#8217;ve upgraded Apache HBase to upstream version 0.90.6, improving system robustness and availability. Also, some of the recent hbck changes were incorporated to better detect and handle various types of corruptions. Lastly, HDFS append support is now disabled by default in this release as it is no longer needed for HBase. Please see <a title="CDH3 update 4 Known Issues and Workarounds" href="https://ccp.cloudera.com/display/CDHDOC/Known+Issues+and+Work+Arounds+in+CDH3" target="_blank">the CDH3 Known Issues and Workarounds page</a> for details.</p>
<p>In addition to the HBase updates, CDH3 update 4 also includes the latest release of Apache Flume (incubating) &#8211; version 1.1.0. A detailed description of what it brings to the table is found <a title="Flume NG blog post" href="http://www.cloudera.com/blog/2011/12/apache-flume-architecture-of-flume-ng-2/" target="_blank">in a previous Cloudera blog post describing its architecture</a>. Please note that we will continue to ship Flume 0.9.4 as well.</p>
<p>More information about how to download or upgrade to CDH3 update 4 can be found <a title="CDH packaging information" href="https://ccp.cloudera.com/display/DOC/CDH+Version+and+Packaging+Information" target="_blank">in the CDH packaging information webpage</a>.  The patches and JIRAs for update 4 are described in the changes files that can be found <a title=" CDH3 archives" href="http://archive.cloudera.com/cdh/3/" target="_blank">in the CDH3 downloads area</a>.  Additional details are available in the <a title="CDH release notes" href="https://ccp.cloudera.com/display/CDHDOC/New+Features+in+CDH3" target="_blank">CDH3 release notes</a>.</p>
<p>Feedback is always welcome, so please email your thoughts and suggestions to cdh-user@cloudera.com.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/cdh3-update-4-is-now-available/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Meet the Presenters: Aaron Myers from Cloudera and Suresh Srinivas from Hortonworks</title>
		<link>http://www.cloudera.com/blog/2012/05/meet-the-presenters-aaron-myers-from-cloudera-and-suresh-srinivas-from-hortonworks/</link>
		<comments>http://www.cloudera.com/blog/2012/05/meet-the-presenters-aaron-myers-from-cloudera-and-suresh-srinivas-from-hortonworks/#comments</comments>
		<pubDate>Tue, 08 May 2012 01:05:24 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[HDFS]]></category>
		<category><![CDATA[HA HDFS]]></category>
		<category><![CDATA[Hadoop Distributed File System]]></category>
		<category><![CDATA[HDFS NameNode]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14766</guid>
		<description><![CDATA[This was originally posted on the Hadoop Summit 2012 blog. Today’s “Meet the Presenters” interview features two speakers: Aaron Myers from Cloudera and Suresh Srinivas from Hortonworks. Aaron and Suresh will be presenting on HDFS NameNode High Availability, one of the hottest topics in the Apache Hadoop space today. Question: Tell us about your current role and [...]]]></description>
			<content:encoded><![CDATA[<p><em>This was originally posted on the Hadoop Summit 2012 <a href="http://hadoopsummit.org/blog/meet-the-presenters-aaron-myers-from-cloudera-and-suresh-srinivas-from-hortonworks/" target="_blank">blog</a></em>.</p>
<p>Today’s “Meet the Presenters” interview features two speakers: Aaron Myers from Cloudera and Suresh Srinivas from Hortonworks. Aaron and Suresh will be presenting on <a title="Apache Hadoop Summit" href="http://hadoopsummit.org/program/#session53" target="_blank">HDFS NameNode High Availability</a>, one of the hottest topics in the Apache Hadoop space today.</p>
<h2>Question: Tell us about your current role and how you interact with Apache Hadoop?</h2>
<p><strong>Aaron: </strong>I work full-time developing Hadoop and supporting Hadoop’s many users. My efforts are primarily focused on HDFS and Hadoop’s security infrastructure.</p>
<p><strong>Suresh: </strong>I have been working on Hadoop for about 4 years. Currently I am on HDFS full-time, with focus on improving reliability, scalability and developing enterprise features. I also work on expanding Apache Hadoop APIs and interfaces to enable new use cases and simplify integration of other solutions with HDFS.</p>
<h2>Question: Tell us about your Hadoop Summit presentation?</h2>
<p><strong>Suresh:</strong> The HDFS NameNode is a robust and reliable service as seen in practice in production at Yahoo! and other organizations. However, the NameNode does not have automatic failover support. A hot failover solution called HA NameNode is currently under active development (<a title="Apache Hadoop HDFS" href="https://issues.apache.org/jira/browse/HDFS-1623" target="_blank">HDFS-1623</a>). This talk will cover the architecture, design and setup. We will also discuss the future direction for HA NameNode.</p>
<h2>Question: What do you expect will be the key takeaway for folks attending your session? </h2>
<p><strong>Aaron:</strong> Because we will be sharing best practices and architectural details, we expect attendees to walk away with a good understanding of what’s required to deploy and operate a highly available HDFS NameNode.</p>
<h2>Question: What are you most looking forward to at Hadoop Summit?</h2>
<p><strong>Aaron: </strong>Chatting in-person with the Hadoop developers and other community members who I interact with frequently, but don’t get to see often.</p>
<p><strong>Suresh:</strong> Interacting with the community and learning from them their Hadoop experiences. I’m also interested in getting feedback on things we can improve and important new features desired by the community.</p>
<h2>Question: What other presentations are you most looking forward to attending?</h2>
<p><strong>Aaron:</strong></p>
<ul>
<li> <a title="Apache Hadoop Summit" href="http://hadoopsummit.org/program/#session29" target="_blank">I accidentally the Namenode: Hadoop Distributed Filesystem Reliability and Durability at Facebook</a> by Andrew Ryan</li>
<li><a title="Apache Hadoop Summit" href="http://hadoopsummit.org/program/#session32" target="_blank">Optimizing MapReduce Job Performance</a> by Todd Lipcon</li>
</ul>
<p><strong> Suresh:</strong></p>
<ul>
<li>Like Aaron,  <a title="Apache Hadoop Summit" href="http://hadoopsummit.org/program/#session29" target="_blank">I accidentally the Namenode: Hadoop Distributed Filesystem Reliability and Durability at Facebook</a> by Andrew Ryan</li>
<li><a title="Apache Hadoop Summit" href="http://hadoopsummit.org/program/#session36" target="_blank">Apache Hadoop and Virtual Machines</a> by Richard McDougall and Sanjay Radia</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/meet-the-presenters-aaron-myers-from-cloudera-and-suresh-srinivas-from-hortonworks/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Announcing Apache Hive 0.9.0</title>
		<link>http://www.cloudera.com/blog/2012/05/announcing-apache-hive-0-9-0/</link>
		<comments>http://www.cloudera.com/blog/2012/05/announcing-apache-hive-0-9-0/#comments</comments>
		<pubDate>Fri, 04 May 2012 17:34:39 +0000</pubDate>
		<dc:creator>Carl Steinbach</dc:creator>
				<category><![CDATA[hive]]></category>
		<category><![CDATA[apache hive]]></category>
		<category><![CDATA[hadoop hive]]></category>
		<category><![CDATA[Hadoop meta data store]]></category>
		<category><![CDATA[Hadoop SQL]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14683</guid>
		<description><![CDATA[This past Monday marked the official release of Apache Hive 0.9.0. Users interested in taking this release of Hive for a spin can download a copy from the Apache archive site. The following post is a quick summary of new features and improvements users can expect to find in this update of the popular data warehousing system for Hadoop. The 0.9.0 release continues the trend of [...]]]></description>
			<content:encoded><![CDATA[<p>This past Monday marked the official release of Apache Hive 0.9.0. Users interested in taking this release of Hive for a spin can download a copy from the <a href="http://archive.apache.org/dist/hive/hive-0.9.0/" target="_blank">Apache archive site</a>. The following post is a quick summary of new features and improvements users can expect to find in this update of the popular data warehousing system for Hadoop.</p>
<p>The 0.9.0 release continues the trend of extending Hive&#8217;s SQL support. Hive now understands the <a href="https://issues.apache.org/jira/browse/HIVE-2005" target="_blank">BETWEEN</a> operator and the <a href="https://issues.apache.org/jira/browse/HIVE-2810" target="_blank">NULL-safe equality operator</a>, plus several new user defined functions (UDF) have now been added. New UDFs include <a href="https://issues.apache.org/jira/browse/HIVE-2695" target="_blank">printf()</a>, <a href="https://issues.apache.org/jira/browse/HIVE-2279" target="_blank">sort_array()</a>, and <a href="https://issues.apache.org/jira/browse/HIVE-1877" target="_blank">java_method()</a>. Also, the <a href="https://issues.apache.org/jira/browse/HIVE-2203" target="_blank">concat_ws()</a> function has been modified to support input parameters consisting of arrays of strings.</p>
<p>This Hive release also includes several significant improvements to the query compiler and execution engine. <a href="https://issues.apache.org/jira/browse/HIVE-2642" target="_blank">HIVE-2642</a> improved Hive&#8217;s ability to optimize UNION queries, <a href="https://issues.apache.org/jira/browse/HIVE-2881" target="_blank">HIVE-2881</a> made the the map-side JOIN algorithm more efficient, and Hive&#8217;s ability to generate optimized execution plans for queries that contain multiple GROUP BY clauses was significantly improved in <a href="https://issues.apache.org/jira/browse/HIVE-2621" target="_blank">HIVE-2621</a>.</p>
<p>HBase users will also be interested in several improvements to Hive&#8217;s HBase StorageHandler, mainly:</p>
<ul>
<li>The ability to access primitive types stored in binary format within HBase (<a href="https://issues.apache.org/jira/browse/HIVE-1634" target="_blank">HIVE-1634</a>),</li>
<li>And support for filter-pushdown for keys (<a href="https://issues.apache.org/jira/browse/HIVE-2861" target="_blank">HIVE-2861</a>, <a href="https://issues.apache.org/jira/browse/HIVE-2815" target="_blank">HIVE-2815</a>, <a href="https://issues.apache.org/jira/browse/HIVE-2771" target="_blank">HIVE-2771</a>).</li>
</ul>
<p>Finally, I&#8217;d like to commend Ashutosh Chauhan on a job well done as the release manager for Hive 0.9.0. Ashutosh became a Hive committer six months ago and since then has had a significant impact on the project by doing lots of code reviews, helping answer questions on the mailing list, and through continued patch submissions. He did a great job as a first-time release manager, and I hope that he will reprise this role in the future!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/announcing-apache-hive-0-9-0/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>How Treato Analyzes Health-related Social Media Big Data with Hadoop and HBase</title>
		<link>http://www.cloudera.com/blog/2012/05/treato-analyzes-health-related-big-data-with-hadoop/</link>
		<comments>http://www.cloudera.com/blog/2012/05/treato-analyzes-health-related-big-data-with-hadoop/#comments</comments>
		<pubDate>Thu, 03 May 2012 13:00:51 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[CDH]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Use Case]]></category>
		<category><![CDATA[Cloudera Case Study]]></category>
		<category><![CDATA[Hadoop Case Study]]></category>
		<category><![CDATA[Hadoop in Healthcare]]></category>
		<category><![CDATA[hadoop use case]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14627</guid>
		<description><![CDATA[This is a guest post by Assaf Yardeni, Head of R&#38;D for Treato, an online social healthcare solution, headquartered in Israel. Three years ago I joined Treato, a social healthcare analysis firm to help treato.com scale up to its present capability. Treato is a new source for healthcare information where health-related user generated content (UGC) [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is a guest post by Assaf Yardeni, Head of R&amp;D for Treato, an online social healthcare solution, headquartered in Israel. </em></p>
<p>Three years ago I joined <a href="http://treato.com/" target="_blank">Treato</a>, a social healthcare analysis firm to help <a href="http://www.treato.com/" target="_blank">treato.com</a> scale up to its present capability. Treato is a new source for healthcare information where health-related user generated content (UGC) from the Internet is aggregated and organized into usable insights for patients, physicians and other healthcare professionals. With oceans of patient-written health-related information available on the Web, and more being published each day, Treato needs to be able to collect and process vast amounts of data – Treato is Big Data par excellence, and my job has been to bring Treato to this stage.</p>
<h2 style="font-size: 14pt; color: #243543;">Before the Hadoop era</h2>
<p>When I arrived at Treato, the team had already developed a Microsoft-based prototype that could organize a limited amount of health-related UGC into relevant insights, as a proof of concept. The system would:</p>
<ul>
<li>Crawl the Web and fetch raw HTML sources,</li>
<li>Extract the user-generated content (i.e. user’s posts) out of the raw sources,</li>
<li>Extract concepts from the posts and index them,</li>
<li>Execute semantic analysis on the posts using natural language processing (NLP) algorithms</li>
<li>And calculate statistics.</li>
</ul>
<p>The prototype was able to prove the initial hypothesis that relevant medical insights can be found in social media, you just have to know how to analyze it. We collected data from dozens of websites and individual social media posts in the tens of millions. We had a handful of text analysis algorithms and could only process a couple million posts per day, but the results were impressive. We found that we were able to identify side effects through social media long before initial FDA or pharmaceutical companies issued warnings about them. For example, when we looked at the discussions about Singulair – an asthma medication – we found that almost half of the user generated content discussed mental disorders. When we looked back through the historical data, we learned that this would have been identifiable in our data four years before the official warning.</p>
<p>In order to gain even more health-related insights, we knew we needed a solution that could crawl and process a larger quantity of data – larger by an order of magnitude. That was the point at which Web scale joined the game. In order to collect massive amounts of posts, we needed to add thousands of data sources. And, of course, all the data we collected would need to be analyzed.</p>
<p>Dealing with a few dozen websites was difficult and costly. But we were able to scale up our Microsoft code to handle collection from a several hundred sites, and could process around 250 million posts. We were running a few old IBM boxes that did the collection work and had developed a job manager that administered crawling and fetching tasks. Different servers ran the indexing and the stats calculations, and we had developed a distributed job manager to direct task executions. Different servers were used for serving the data. We didn&#8217;t have any storage solution, and all of the boxes worked with local drives.</p>
<p>Besides the fact that administering the process was hell, it was expensive in terms of CPU, network and input/output (I/O); e.g., after each stage, the data needed to be moved to a different server for the next stage. In addition, our job manager didn’t deal with failures; every time a task failed we needed to handle it manually. Needless to say, supporting collection and analysis of thousands of websites would have been impossible using this approach.</p>
<h2 style="font-size: 14pt; color: #243543;">Looking at scale</h2>
<p>In the beginning of 2010, we started searching for solutions that could support the capabilities we wanted. The requirements included:</p>
<ol>
<li>Reliable and scalable storage.</li>
<li>Reliable and scalable processing infrastructure.</li>
<li>Search engine capabilities (for retrieving posts) with high availability (HA).</li>
<li>Scalable real-time store for retrieving stats, with HA.</li>
</ol>
<p>We wanted the ability to periodically reprocess the data in a timely manner, so new algorithms or other analysis improvements would take effect on all historical data.</p>
<p>We wanted to know how much it costs to deal with X number of posts, and to be able to scale according to this formula.</p>
<p>We wanted a technology and architecture that would scale with the business.</p>
<p>We searched for answers to questions such as: &#8220;How does Google do it?” and it didn&#8217;t take too long to find Google&#8217;s papers, documentation on Hadoop and MapReduce, and so on.</p>
<p>We started digging deeper in these areas. After a short investigation, it was clear that the Hadoop Distributed File System (HDFS) would support our storage demands, and MapReduce would be a good fit for the processing infrastructure.</p>
<h2 style="font-size: 14pt; color: #243543;">First Hadoop cluster in the lab</h2>
<p>While looking for Hadoop distributions, I encountered <a href="http://www.cloudera.com/hadoop/">Cloudera&#8217;s Distribution including Apache Hadoop</a> (CDH), however, I decided to start with a manual installation since this usually helps me better understand how things work. We started a pilot, setting up a 2 node cluster on Linux boxes. As mentioned, the first installation was done totally manually using the binaries downloaded from Apache, and gently configuring the system. This process was ugly: I needed to download all sorts of binaries from different sources, deal with networking issues, exchange of SSH keys between the nodes, formatting the FS and all sorts of OS tweaks.</p>
<p>We started testing the behavior of the new technology, first with some simple WordCount and pi calculations, and then we quickly wrote MapReduce (Java) code that did parts of our processing and tested it on real HTML sources. The little cluster just worked: I was able to submit jobs &amp; monitor them; I tested recovery from task failures, crash of a node, etc.</p>
<p>Next, I wanted to see how this Hadoop solution scaled. To do this, I installed an additional box and added it to our little Hadoop cluster. It was awesome: after adding the new slave to the cluster, everything was transparent. Suddenly we had more capacity on the file system and more horsepower for processing. The job submission was the same as before; the job submitter (Hadoop client) didn&#8217;t even know that the cluster had changed, it simply got the results quicker. We were able to crunch some numbers and got a dollar-per-post cost.</p>
<p>So, the evaluation was great, but still there was the awful installation and maintenance process. That’s when we started to test <a href="http://www.cloudera.com/hadoop/">Cloudera&#8217;s Distribution including Apache Hadoop</a>; I think it was version 2 of CDH back then. We re-installed our little cluster from scratch using this Hadoop distribution. The installation process was much easier, and the documentation helped. The setup took only a couple of hours. (CDH3 takes less than an hour). </p>
<p>After we found a good package, we wanted to set up a bigger cluster for prototyping, and deeper tests and evaluations. Amazon seemed to be the perfect place for that. Using CDH we set up a 10 node (small instances) cluster on EC2. This was used for performance evaluation and the processing rate was about 10M-20M posts per day &#8212; approximately 6 times higher than the performance from our pre-Hadoop solution.</p>
<p>We decided to go with Hadoop. This was a dramatic decision, as we took a company with a Microsoft-oriented development team, ported all the code into Java, all the while adopting a new and very complicated technology stack. This actually meant starting implementation from the beginning, opening a new integrated development environment (IDE) and starting to code from scratch. </p>
<p>In order to reduce risks and avoid critical mistakes, we searched for someone who has &#8220;been there, done that&#8221; so we could learn from them and validate our overall planned new architecture. Cloudera was our first choice; it made sense to go with a company that has multiple setups behind them, some of which are very large clusters. Cloudera sent Solutions Architect, Lars George, to our offices for two days, and we gave him our suggested design in advance. We felt lucky to have Lars, an HBase committer and author of <a href="http://shop.oreilly.com/product/0636920014348.do" target="_blank"><em>HBase: The Definitive Guide</em></a>,<em> </em>as our consultant since HBase was one of the core technologies we were using.</p>
<p>For the first implementation phase, we decided to go with HDFS, MapReduce &amp; HBase. Our in-house-developed crawlers were using HBase as the store for the list of URLs to be fetched. This table should be able to scale to billions of rows. The fetcher (the component in charge of fetching the raw HTML sources) gets the URL queues out of HBase, runs HTTP requests, and stores the raw HTML sources in large files on top of HDFS (few gigs per file). Both the crawler and fetcher don’t use any relational database or any other kind of store except HDFS &amp; HBase. These two components are network and I/O intensive, but CPU is not much of an issue.</p>
<p>Next comes the processing. Each line in the HDFS files contains an HTML source and metadata related to this source. For each directory of files in HDFS, the following processing jobs need to be executed:</p>
<ol>
<li>Turn the unstructured HTML into a list of post entities (content, timestamp, etc.)</li>
<li>Each post needs to be processed as follows:</li>
<ul>
<li>Index key terms – extract medical concepts out of the post content, using Treato&#8217;s extensive knowledge base</li>
<li>Execute text analysis algorithms</li>
</ul>
<li>Calculate all statistics and update the HBase stats tables.</li>
<li>Post all documents (user’s posts) into our search engine (Solr).</li>
</ol>
<p>During this process, many database queries and updates are needed. For example, each post retrieved may potentially already exist in our system, and of course we don&#8217;t want to add a duplicate post to our system, nor invest processing power on documents we already have. In order to accomplish this, we need to calculate a hash for each post, and then check it against a database containing all of the existing hashes. For this purpose HBase works perfectly in terms of both latency and load.</p>
<p>After the design phase, we started implementation. All R&amp;D teams worked on porting their code into Java, and our Ops team worked on planning the data center (we decided on co-location data center setup).</p>
<p>For the initial setup, we had 11 boxes that comprised our Hadoop cluster, two of which were name nodes in an active / passive mode (one was in standby for manual failover in case the active NameNode failed). Nine nodes were slaves hosting DataNodes, TaskTrackers and Region-Servers daemons. In addition to this we had three boxes running Zookeeper services.</p>
<p>The new system was capable of analyzing 50M posts per day. This was a significant performance improvement. In addition, it was reasonable to maintain, reliable and worked quite smoothly. Of course, there were bumps in the road, but in the end we managed to overcome them all.</p>
<p>We have continued to improve and expand the solution, and today we can process 150 – 200 million user posts per day. In subsequent blog posts, I will share in greater detail our system design, use of HBase, and cluster architecture.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/treato-analyzes-health-related-big-data-with-hadoop/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Apache MRUnit 0.9.0-incubating has been released!</title>
		<link>http://www.cloudera.com/blog/2012/05/apache-mrunit-0-9-0-incubating-has-been-released/</link>
		<comments>http://www.cloudera.com/blog/2012/05/apache-mrunit-0-9-0-incubating-has-been-released/#comments</comments>
		<pubDate>Wed, 02 May 2012 04:38:51 +0000</pubDate>
		<dc:creator>Brock Noland</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[Hadoop Testing]]></category>
		<category><![CDATA[Map Reduce Testing]]></category>
		<category><![CDATA[MapReduce Testing]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14603</guid>
		<description><![CDATA[This post was originally posted on the Apache Software Foundation&#8217;s blog. We (the Apache MRUnit team) have just released Apache MRUnit 0.9.0-incubating (tarball, nexus, javadoc). Apache MRUnit is an Apache Incubator project that is a Java library which helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality [...]]]></description>
			<content:encoded><![CDATA[<p><em>This post was originally posted on the <a href="https://blogs.apache.org/mrunit/entry/apache_mrunit_0_9_0" target="_blank">Apache Software Foundation&#8217;s blog</a>.</em></p>
<p>We (the Apache <abbr title="MapReduce Unit">MRUnit</abbr> team) have just released Apache MRUnit 0.9.0-incubating (<a href="http://www.apache.org/dyn/closer.cgi/incubator/mrunit/" target="_blank">tarball</a>, <a href="https://repository.apache.org/index.html#nexus-search;gav~org.apache.mrunit~~~~" target="_blank">nexus</a>, <a href="http://incubator.apache.org/mrunit/documentation/javadocs/0.9.0-incubating/index.html" target="_blank">javadoc</a>). Apache MRUnit is an Apache Incubator project that is a Java library which helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they&#8217;re deployed to a production system.</p>
<p>The MRUnit project is quite active, 0.9.0 is our fourth release since entering the incubator and we have added 4 new committers beyond the projects initial charter! We are very interested in having new contributors and committers join the project! Please join our <a href="http://incubator.apache.org/mrunit/community/mailing_lists.html" target="_blank">mailing list</a> to find out how you can help!</p>
<p>The MRUnit build process has changed to produce mrunit-0.9.0-hadoop1.jar and mrunit-0.9.0-hadoop2.jar instead of mrunit-0.9.0-hadoop020.jar, mrunit-0.9.0-hadoop100.jar and mrunit-0.9.0-hadoop023.jar. The hadoop1 classifier is for all Apache Hadoop versions based off the 0.20.X line including 1.0.X. The hadoop2 classifier is for all Apache Hadoop versions based off the 0.23.X line including the unreleased 2.0.X.</p>
<p>This <a href="https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12311292&#038;version=12316360" target="_blank">release</a> contains 2 new features, 15 improvements and 6 bug fixes. I will highlight a few below:</p>
<ul>
<li>Support custom counter checking in <a href="https://issues.apache.org/jira/browse/MRUNIT-68" target="_blank">MRUNIT-68</a></li>
<li>runTest() should optionally ignore output order in <a href="https://issues.apache.org/jira/browse/MRUNIT-91" target="_blank">MRUNIT-91</a></li>
<li>Driver.runTest throws RuntimeException should it throw AssertionError in <a href="https://issues.apache.org/jira/browse/MRUNIT-54" target="_blank">MRUNIT-54</a></li>
<li>o.a.h.mrunit.mapreduce.MapReduceDriver should support a combiner in <a href="https://issues.apache.org/jira/browse/MRUNIT-67" target="_blank">MRUNIT-67</a></li>
<li>Better support for other serializations besides Writable:  <a href="https://issues.apache.org/jira/browse/MRUNIT-70" target="_blank">MRUNIT-70</a>,  <a href="https://issues.apache.org/jira/browse/MRUNIT-86">MRUNIT-86</a>,  <a href="https://issues.apache.org/jira/browse/MRUNIT-99" target="_blank">MRUNIT-99</a>,  <a href="https://issues.apache.org/jira/browse/MRUNIT-77" target="_blank">MRUNIT-77</a></li>
<li>Better error messages from validate, null checking and forgetting to set mappers and reducers: <a href="https://issues.apache.org/jira/browse/MRUNIT-74" target="_blank">MRUNIT-74</a>, <a href="https://issues.apache.org/jira/browse/MRUNIT-66" target="_blank">MRUNIT-66</a>, <a href="https://issues.apache.org/jira/browse/MRUNIT-65" target="_blank">MRUNIT-65</a></li>
<li>add static convenience methods to PipelineMapReduceDriver class in <a href="https://issues.apache.org/jira/browse/MRUNIT-89" target="_blank">MRUNIT-89</a></li>
<li>Test and Deprecate Driver.{*OutputFromString,*InputFromString} Methods in <a href="https://issues.apache.org/jira/browse/MRUNIT-48" target="_blank">MRUNIT-48</a></li>
</ul>
<h2 style="font-size:14pt;color:#243543;">Support custom counter checking</h2>
<p>It has always been possible to check the counter values like so:</p>
<pre class="code">assertEquals(2, mapDriver.getCounters().findCounter(CustomMapper.CustomCounter.NAME).getValue());
</pre>
<p>but this is quite tedious. As such Jarek Jarcec Cecho (our second newest committer) added this feature directly to the drivers:</p>
<pre class="code">.withCounter(CustomMapper.CustomCounter.Name, 2);
</pre>
<h2 style="font-size:14pt;padding-top:16px;color:#243543;">runTest() should optionally ignore output order</h2>
<p>Previous to this change MRUnit required Mapper/Reducer classes to output key value pairs in the order specified on the test. Well defined output order is common, but strictly not universal. Dave Beech (our newest committer) contributed a patch so you optionally turn this ordered requirement off by using:</p>
<pre class="code">.runTest(false)
</pre>
<p style="padding-top:12px">instead of</p>
<pre class="code">.runTest()
</pre>
<h2 style="font-size:14pt;line-height:1.3em;padding-top:16px;color:#243543;">Driver.runTest throws RuntimeException should it throw AssertionError</h2>
<p>Previous versions of MRUnit threw a RuntimeException when a test failed. This worked well, but it meant that testing frameworks saw the the test as having erred, not failed. We have changed this to AssertionError so that testing frameworks see the tests as failed. The distinction is small but important.</p>
<h2 style="font-size:14pt;color:#243543;">o.a.h.mrunit.mapreduce.MapReduceDriver should support a combiner</h2>
<p>Previously the MRUnit only supported a combiner in the mapred MapReduceDriver class but now the mapreduce MapReduceDriver also supports a combiner by:</p>
<pre class="code">MapReduceDriver.newMapReduceDriver(mapper, reducer, combiner)</pre>
<p style="padding-top:12px">or</p>
<pre class="code">.withCombiner(combiner) or .setCombiner(combiner)</pre>
<h2 style="font-size:14pt;padding-top:16px;color:#243543;">Better support for other serializations besides Writable</h2>
<p>Previous versions of MRUnit did not support JavaSerialization, Avro or other Serialization frameworks well. We improved alternative serialization support by not forcing K2 in MapReduceDriver to be Comparable and supporting serializations that cannot clone into a object or that do not have default constructors.</p>
<h2 style="font-size:14pt;line-height:1.3em;color:#243543;">Better error messages from validate, null checking and forgetting to set mappers and reducers</h2>
<p>We have improved checking of parameters passed to MRUnit and the error messages when the parameters are invalid including throwing NullPointerException immediately when receiving a null value and throwing a IllegalStateExcpetion when no mapper or reducer class is provided instead of a NullPointerException.</p>
<h2 style="font-size:14pt;color:#243543;">Add static convenience methods to PipelineMapReduceDriver class</h2>
<p>add static convenience constructors similar to those in the other driver classes:</p>
<pre class="code">PipelineMapReduceDriver.newPipelineMapReduceDriver()</pre>
<p style="padding-top:12px">or</p>
<pre class="code">PipelineMapReduceDriver.newPipelineMapReduceDriver(list of Pair<Mapper, Reducer>)</pre>
<h2 style="font-size:14pt;line-height:1.3em;padding-top:16px;color:#243543;">Test and Deprecate Driver.{*OutputFromString,*InputFromString} Methods</h2>
<p>The OutputFromString and InputFromString methods are now deprecated because they required Text inputs or outputs with no way to enforce that the inputs or outputs from a mapper or reducer were actually Text. These methods also provided little convenience as a user can just pass the string they intended to new Text(string)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/apache-mrunit-0-9-0-incubating-has-been-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>HBaseCon 2012: A Glimpse into the Operations Track</title>
		<link>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-operations-track/</link>
		<comments>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-operations-track/#comments</comments>
		<pubDate>Wed, 25 Apr 2012 13:00:03 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HBase Conference]]></category>
		<category><![CDATA[HBase Event]]></category>
		<category><![CDATA[HBaseCon]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14471</guid>
		<description><![CDATA[HBaseCon 2012 is only a month away! The conference takes place May 22 in San Francisco, California and the event is poised to sell out. For those unfamiliar with the Apache HBase project, HBase is open source software that allows for real-time random read/write access to your Big Data in Apache Hadoop with very low [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hbasecon.com/">HBaseCon 2012</a> is only a month away! The conference takes place May 22 in San Francisco, California and the event is poised to sell out.</p>
<div style="float: right; padding-left: 12px; padding-top: 16px;"><a href="http://www.hbasecon.com" target="_blank"><img src="http://www.cloudera.com/wp-content/uploads/2012/03/HBaseCon2012-logo-300px.jpg" alt="HBaseCon 2012" width="200px" /></a></div>
<p>For those unfamiliar with the Apache HBase project, HBase is open source software that allows for real-time random read/write access to your Big Data in Apache Hadoop with very low latency and high scalability. Presentations in the HBaseCon 2012 Operations track will explain the state of HBase today, how to mitigate HBase failures, and best practices in cluster deployment and cluster monitoring.</p>
<h2 style="font-size: 18pt;">Operations Track Presentations</h2>
<p style="padding-top: 8px;"><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">Case Study of HBase Operations at Facebook</span></a><br /> <a href="http://www.hbasecon.com/speakers/ryan-thiessen/">Ryan Thiessen</a>, Facebook</p>
<p>At Facebook we have demanding HBase installations which are used for important and real-time user activity, so failure in an HBase cluster can be a serious issue requiring immediate attention. This session will discuss a variety of real-world scenarios where we have had failures in our HBase systems, how our Operations and Engineering teams have worked to mitigate many of these issues, and where HBase still needs to improve instead of relying on workarounds. The database should never go down. This talk is aimed at developers and other users of HBase (both current and potential) who are interested in an operational perspective on the state of HBase today.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">HBase Backup</span></a><br /> <a href="http://www.hbasecon.com/speakers/sunil-sitaula/">Sunil Sitaula</a>, Cloudera<br /><a href="http://www.hbasecon.com/speakers/madhuwanti-vaidya/">Madhuwanti Vaidya</a>, Facebook</p>
<p>Reliable backup and recovery is one of the main requirements for any enterprise grade applications. HBase has been very well embraced by enterprises needing random, real-time read/write access with huge volumes of data and ease of scalability. As such they are looking for backup solutions that are reliable, easy to use, and can work with existing infrastructure. HBase comes with several backup options but there is a clear need to improve the native export mechanisms. This talk will cover various options that are available out of the box, their drawbacks and what various companies are doing to make backup and recovery efficient. In particular it will cover what Facebook has done to improve performance of backup and recovery process with minimal impact to production cluster.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">HBase Security for the Enterprise</span></a><br /> <a href="http://www.hbasecon.com/speakers/andrew-purtell/">Andrew Purtell</a>, Trend Micro</p>
<p>Trend Micro developed the new security features in HBase 0.92 and has the first known deployment of secure HBase in production. We will share our motivations, use cases, experiences, and provide a 10 minute tutorial on how to set up a test secure HBase cluster and a walk through of a simple usage example. The tutorial will be carried out live on an on-demand EC2 cluster, with a video backup in case of network or EC2 unavailability.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">Lightning Talk | Developing Real Time Analytics Applications Using HBase in the Cloud</span></a><br /> <a href="http://www.hbasecon.com/speakers/rick-tucker/">Rick Tucker</a>, Sproxil</p>
<p>As small companies are adapting to handle Big Data, the cloud and HBase enable developers to leverage that data to provide revenue generating real-time applications. When developing a real-time application for an existing system, one must balance incrementing counters in real-time with MapReduce jobs over the same data-set. When maintaining an analytics platform, ensuring data accuracy is essential. At Sproxil, SMS logs are ingested into HBase at a growing rate and we report metrics such as SMS throughput, unique user growth over time, and return SMS user activity in real time. Sproxil provides a versatile analytics application enabling customers to handpick statistics on demand to gain market insights enabling them to react quickly to trends. This talk will identify the most profitable metrics and demonstrate how to calculate them using Map Reduce while continually updating data as it arrives.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">Lightning Talk | Unique Sets on HBase and Hadoop</span></a><br /> <a href="http://www.hbasecon.com/speakers/elliott-clark/">Elliott Clark</a>, StumbleUpon</p>
<p>Determining the number of unique users that have interacted with a web page, game, or application is a very common use case. HBase is becoming an increasingly accepted tool for calculating sets or counts of unique individuals who meet some criteria. Computing these statistics can range in difficulty from very simple to very difficult. This session will explore how different approaches have worked or not worked at scale for counting uniques on HBase with Hadoop.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size: 13pt; color: #ba160c; font-weight: bold;">Lightning Talk | Orchestrating Clusters with Ironfan and Chef</span></a><br /> <a href="http://www.hbasecon.com/speakers/robert-berger/">Robert Berger</a>, Runa</p>
<p>This session will discuss how you can represent your complete cluster with one config file and have it deployed to Cloud or Bare Metal. Infochmimps’ Ironfan builds on Opscode Chef to allow you to specify and orchestrate all flavors of your cluster’s deployment, monitoring and growth. Not just the core HBase/HDFS/MapReduce/Hive/Flume, etc. but all the elements including web / app servers, mysql, redis, rabbitmq and whatever other servers needed to implement your service. These same tools can manage variations for development, staging, R&amp;D as well as the target “rendering” to various Clouds, Bare Metal or even Vagrant VMs.</p>
<p><a href="http://hbaseconsf.eventbrite.com/" target="_blank"><img src="http://www.hbasecon.com/wp-content/uploads/2012/02/btn-register-small.png" alt="Register for HBaseCon 2012" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-operations-track/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

