<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; general</title>
	<atom:link href="http://www.cloudera.com/blog/category/general/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Thu, 24 May 2012 17:53:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Apache HBase 0.94 is now released</title>
		<link>http://www.cloudera.com/blog/2012/05/apache-hbase-0-94-is-now-released/</link>
		<comments>http://www.cloudera.com/blog/2012/05/apache-hbase-0-94-is-now-released/#comments</comments>
		<pubDate>Wed, 16 May 2012 16:58:52 +0000</pubDate>
		<dc:creator>Himanshu Vashishtha</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HBase features]]></category>
		<category><![CDATA[HBase release]]></category>
		<category><![CDATA[HBase Update]]></category>
		<category><![CDATA[Real-time Hadoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14484</guid>
		<description><![CDATA[Apache HBase 0.94.0 has been released! This is the first major release since the January 22nd HBase 0.92 release. In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes). Performance Related JIRAs Below are a few of the important performance related JIRAs: [...]]]></description>
			<content:encoded><![CDATA[<p>Apache HBase 0.94.0 has been released! This is the first major release since the January 22nd HBase 0.92 release. In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes).</p>
<h2>Performance Related JIRAs</h2>
<p>Below are a few of the important performance related JIRAs:</p>
<ul>
<li title="HBASE-5074"><strong>Read Caching improvements:</strong> HDFS stores data in one block file and its corresponding metadata (checksum) in another block file. This means that every read into the HBase block cache may consume up to two disk ops, one to the datafile and one to the checksum file. <a title="HBASE-5074" href="https://issues.apache.org/jira/browse/HBASE-5074">HBASE-5074</a>: &#8220;Support checksums in HBase block cache&#8221; adds a block level checksum in the HFile itself in order to avoid one disk op,  boosting up the read performance. This feature is <em>enabled</em> by default.</li>
<li><strong>Seek optimizations:</strong> Till now, if there were several StoreFiles for a column family in a region, HBase would seek in each such files and merge the results, even if the row/column we are looking for is in the most recent file.  <a title="HBase-4465" href="https://issues.apache.org/jira/browse/HBASE-4465" target="_blank">HBASE-4465</a>: &#8220;Lazy Seek optimization of StoreFile Scanners&#8221; optimizes scanner reads to read the<em> most recent</em> StoreFile first by <em>lazily seeking</em> the StoreFiles. This is achieved by introducing a fake keyvalue with its timestamp equal to the maximum timestamp present in the particular StoreFile. Thus, a disk seek is avoided until the KeyValueScanner for a StoreFile is bubbled up the heap, implying a need to do a real read operation.  This should provide a significant read performance boost, especially for IncrementColumnValue operations where we care only for latest value. This feature is <em>enabled</em> by default.</li>
<li><strong>Write to WAL optimizations: </strong>HBase write throughput is upper bounded by the write rate of WAL where the log is replicated to a number of datanodes, depending on the replication factor. <a title="HBase-4608" href="https://issues.apache.org/jira/browse/HBASE-4608" target="_blank">HBASE-4608</a>: &#8220;HLog Compression&#8221; adds a custom dictionary-based compression of HLogs for faster replication on HDFS datanodes, thus improving overall write rate for HBase. This feature is considered experimental and is<em> off</em> by default.</li>
</ul>
<h2>New Feature Related JIRAs</h2>
<p>Here is a list of some of the important JIRAs related to adding new features:</p>
<ul>
<li><strong>More powerful first aid box:</strong> The previous HBck tool did a good job of fixing inconsistencies related to region assignments but lacked some basic features like fixing orphaned regions, region holes, overlapping regions, etc. <a href="https://issues.apache.org/jira/browse/HBASE-5128" target="_blank">HBASE-5128: &#8220;Uber hbck&#8221;</a>, adds these missing features to the first aid box.</li>
<li><strong>Simplified Region Sizing:</strong> Deciding a region size is always tricky as it varies on a number of dynamic parameters such as data size, cluster size, workload, etc. <a title="HBase-4365" href="https://issues.apache.org/jira/browse/HBASE-4365" target="_blank">HBASE-4365</a>: &#8220;Heuristic for Region size&#8221; adds a heuristic where it increases the split size threshold of a table region as the data grows, thus limiting the number of region splits.</li>
<li><strong>Smarter transaction semantics: </strong>Though HBase supports single row level transaction, if there are a number of updates (Puts/Deletes) to an individual row, it will lock the row for each of these operations.<a title="HBase-3584" href="https://issues.apache.org/jira/browse/HBASE-3584" target="_blank"> HBASE-3584</a>: &#8220;Atomic Put &amp; Delete in a single transaction&#8221; enhances the HBase single row locking semantics by allowing Puts and Deletes on a row to be executed in a single call. This feature is <em>on</em> by default.</li>
</ul>
<p>This major release has a number of new features and bug fixes; a total of <a title="397 resolved jiras" href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;jqlQuery=project+%3D+HBASE+AND+fixVersion+%3D+%220.94.0%22+AND+resolution+%3D+Fixed+ORDER+BY+priority+DESC&amp;mode=hide" target="_blank">397 resolved JIRAs</a> with 140 enhancements and 180 bug fixes. It is compatible with 0.92. This opens up a window of opportunity to backport some of the cool features back in CDH4, which is based on the 0.92 branch.</p>
<h2>Acknowledgements</h2>
<p>Thanks to everyone who contributed to this release and a hat tip to Lars Hofhansl of Salesforce for being the release manager.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/apache-hbase-0-94-is-now-released/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Meet the Presenter: Todd Lipcon</title>
		<link>http://www.cloudera.com/blog/2012/05/meet-the-presenter-todd-lipcon/</link>
		<comments>http://www.cloudera.com/blog/2012/05/meet-the-presenter-todd-lipcon/#comments</comments>
		<pubDate>Mon, 14 May 2012 17:44:41 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[Hadoop Summit]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14877</guid>
		<description><![CDATA[Today’s interview features Todd Lipcon, software engineer for Cloudera. Todd will be presenting Optimizing MapReduce Job Performance at Hadoop Summit. Question: Tell us about your current role and how you interact with Apache Hadoop? Todd: I’m a software engineer on Cloudera’s platform engineering team, where I spend most of my time contributing code to open [...]]]></description>
			<content:encoded><![CDATA[<p>Today’s interview features Todd Lipcon, software engineer for Cloudera. Todd will be presenting <a href="http://hadoopsummit.org/program/#session32" target="_blank"><em>Optimizing MapReduce Job Performance</em></a> at Hadoop Summit.</p>
<h2>Question: Tell us about your current role and how you interact with Apache Hadoop?</h2>
<p><strong>Todd:</strong> I’m a software engineer on Cloudera’s platform engineering team, where I spend most of my time contributing code to open source projects like Apache Hadoop and HBase. Most recently I’ve been implementing the automatic HA failover feature in Hadoop 2.0, but I’ve also spent a lot of time working on understanding and improving performance of the Hadoop stack.</p>
<h2>Question: Tell us about your Hadoop Summit presentation?</h2>
<p><strong>Todd:</strong> At this year’s summit, I will be presenting about the internals of MapReduce and how you can tune your MapReduce jobs for optimal performance. A lot of developers see MapReduce as a black box, but looking inside that box can help you understand where you might have bottlenecks or easy opportunities to improve performance by changing a few configuration parameters.</p>
<h2>Question: What do you expect will be the key takeaway for folks attending your session?</h2>
<p><strong>Todd:</strong> I hope attendees will walk away with a better understanding of each of the phases of MapReduce task execution, and a few key configuration parameters they can play with to get better performance without changing their code.</p>
<h2>Question: What other presentations are you most looking forward to attending?</h2>
<p><strong>Todd:</strong> I’m really looking forward to Josh Wills’ talk on BranchReduce: Distributed Branch-and-Bound on YARN. There are a lot of optimization problems which can be solved by branch-and-bound approaches, and it’s only recently with the introduction of YARN that these types of algorithms can be efficiently built on Hadoop. Not only this a fresh topic, Josh is also an entertaining speaker!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/meet-the-presenter-todd-lipcon/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cloudera Manager 4.0 Beta released</title>
		<link>http://www.cloudera.com/blog/2012/05/cloudera-manager-4-0-beta-released/</link>
		<comments>http://www.cloudera.com/blog/2012/05/cloudera-manager-4-0-beta-released/#comments</comments>
		<pubDate>Mon, 14 May 2012 13:00:12 +0000</pubDate>
		<dc:creator>Aparna Ramani</dc:creator>
				<category><![CDATA[cloudera manager]]></category>
		<category><![CDATA[general]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14875</guid>
		<description><![CDATA[We&#8217;re happy to announce the Beta release of Cloudera Manager 4.0.  This version of Cloudera Manager includes support for CDH4 Beta2 and several new features for both the Free edition and the Enterprise edition. Please try it out and send your comments to beta@cloudera.com. As always, we look forward to your feedback. ]]></description>
			<content:encoded><![CDATA[<p>We&#8217;re happy to announce the Beta release of Cloudera Manager 4.0. </p>
<p>This version of Cloudera Manager includes support for <a title="Introducing CDH4 Beta 2" href="http://www.cloudera.com/blog/2012/04/introducing-cdh4-beta-2/" target="_blank">CDH4 Beta2</a> and several new features for both the <a title="Free edition" href="https://ccp.cloudera.com/display/FREE400BETA/New+Features+in+Cloudera+Manager+Free+Edition+4.0" target="_blank">Free edition</a> and the <a title="Enterprise edition" href="https://ccp.cloudera.com/display/ENT400BETA/New+Features+in+Cloudera+Manager+4.0" target="_blank">Enterprise edition</a>.</p>
<p>Please <a title="try it out" href="https://ccp.cloudera.com/display/SUPPORT/Downloads" target="_blank">try it out</a> and send your comments to beta@cloudera.com. As always, we look forward to your feedback. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/cloudera-manager-4-0-beta-released/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>CDH3 update 4 is now available</title>
		<link>http://www.cloudera.com/blog/2012/05/cdh3-update-4-is-now-available/</link>
		<comments>http://www.cloudera.com/blog/2012/05/cdh3-update-4-is-now-available/#comments</comments>
		<pubDate>Wed, 09 May 2012 22:13:01 +0000</pubDate>
		<dc:creator>David S. Wang</dc:creator>
				<category><![CDATA[CDH]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[#cdh3]]></category>
		<category><![CDATA[cloudera hadoop distribution]]></category>
		<category><![CDATA[hadoop distribuition]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14723</guid>
		<description><![CDATA[We are happy to officially announce the general availability of CDH3 update 4. This update consists primarily of reliability enhancements as well as a number of minor improvements. First, there have been a few notable HBase updates. In this release, we&#8217;ve upgraded Apache HBase to upstream version 0.90.6, improving system robustness and availability. Also, some [...]]]></description>
			<content:encoded><![CDATA[<p>We are happy to officially announce the general availability of CDH3 update 4. This update consists primarily of reliability enhancements as well as a number of minor improvements.</p>
<p>First, there have been a few notable HBase updates. In this release, we&#8217;ve upgraded Apache HBase to upstream version 0.90.6, improving system robustness and availability. Also, some of the recent hbck changes were incorporated to better detect and handle various types of corruptions. Lastly, HDFS append support is now disabled by default in this release as it is no longer needed for HBase. Please see <a title="CDH3 update 4 Known Issues and Workarounds" href="https://ccp.cloudera.com/display/CDHDOC/Known+Issues+and+Work+Arounds+in+CDH3" target="_blank">the CDH3 Known Issues and Workarounds page</a> for details.</p>
<p>In addition to the HBase updates, CDH3 update 4 also includes the latest release of Apache Flume (incubating) &#8211; version 1.1.0. A detailed description of what it brings to the table is found <a title="Flume NG blog post" href="http://www.cloudera.com/blog/2011/12/apache-flume-architecture-of-flume-ng-2/" target="_blank">in a previous Cloudera blog post describing its architecture</a>. Please note that we will continue to ship Flume 0.9.4 as well.</p>
<p>More information about how to download or upgrade to CDH3 update 4 can be found <a title="CDH packaging information" href="https://ccp.cloudera.com/display/DOC/CDH+Version+and+Packaging+Information" target="_blank">in the CDH packaging information webpage</a>.  The patches and JIRAs for update 4 are described in the changes files that can be found <a title=" CDH3 archives" href="http://archive.cloudera.com/cdh/3/" target="_blank">in the CDH3 downloads area</a>.  Additional details are available in the <a title="CDH release notes" href="https://ccp.cloudera.com/display/CDHDOC/New+Features+in+CDH3" target="_blank">CDH3 release notes</a>.</p>
<p>Feedback is always welcome, so please email your thoughts and suggestions to cdh-user@cloudera.com.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/cdh3-update-4-is-now-available/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Introducing CDH4 Beta 2</title>
		<link>http://www.cloudera.com/blog/2012/04/introducing-cdh4-beta-2/</link>
		<comments>http://www.cloudera.com/blog/2012/04/introducing-cdh4-beta-2/#comments</comments>
		<pubDate>Tue, 24 Apr 2012 12:00:43 +0000</pubDate>
		<dc:creator>Charles Zedlewski</dc:creator>
				<category><![CDATA[CDH]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[CDH Release]]></category>
		<category><![CDATA[CDH4]]></category>
		<category><![CDATA[hadoop distribuition]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14350</guid>
		<description><![CDATA[I&#8217;m pleased to inform our users and customers that we have released the Cloudera&#8217;s Distribution Including Apache Hadoop version 4 (CDH4) 2nd and final beta today. We received great feedback from the community from the first beta and this release incorporates that feedback as well as a number of new enhancements. CDH4 has a great [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m pleased to inform our users and customers that we have released the Cloudera&#8217;s Distribution Including Apache Hadoop version 4 (CDH4) 2nd and final beta today. We received great feedback from the community from the first beta and this release incorporates that feedback as well as a number of new enhancements.</p>
<p>CDH4 has a great many enhancements compared to CDH3.</p>
<ul>
<li>Availability &#8211; a high availability namenode, better job isolation, improved hard disk failure handling, and multi-version support</li>
<li>Utilization &#8211; multiple namespaces and a slot-less resource management model</li>
<li>Performance &#8211; improvements in HBase, HDFS, MapReduce, Flume and compression performance</li>
<li>Usability &#8211; broader BI support, expanded API options, a more responsive Hue with broader browser support</li>
<li>Extensibility &#8211; HBase co-processors enable developers to create new kinds of real-time big data applications, the new MapReduce resource management model enables developers to run new data processing paradigms on the same cluster resources and storage</li>
<li>Security &#8211; HBase table &amp; column level security and Zookeeper authentication support</li>
</ul>
<h2>Some items of note about this beta:</h2>
<p>This is the second (and final) beta for CDH4, and this version has all of the major component changes that we&#8217;ve planned to incorporate before the platform goes GA.  The second beta:</p>
<ul>
<li>Incorporates the Apache Flume, Hue, Apache Oozie and Apache Whirr components that did not make the first beta</li>
<li>Broadens the platform support back out to our normal release matrix of Red Hat, CentOS, SUSE, Ubuntu and Debian</li>
<li>Standardizes our release matrix of supported databases to include MySQL, PostgresSQL and Oracle</li>
<li>Includes a number of improvements to existing components like adding auto-failover support to HDFS&#8217;s high availability feature and adding multi-homing support to HDFS and MapReduce</li>
<li>Incorporates a number of fixes that were identified during the first beta period like removing a HBase performance regression</li>
</ul>
<p>To recap, some CDH components have undergone substantial revamps and we have transition plans for these. There is a significantly redesigned MapReduce (aka MR2) with a similar API to the old MapReduce but with new daemons, user interface and more. MR2 is part of CDH4, but we also decided it makes sense to ship with the MapReduce from CDH3 (aka MR1) which is widely used, thoroughly debugged and stable. We will support both generations of MapReduce for the life of CDH4, which will allow customers and users to take advantage of all of the new CDH4 features while making the transition to the new MapReduce in a timeframe that makes sense for them. Similarly, Apache Flume in CDH4 is substantially revamped (aka Flume NG).  The new design is simpler, more scaleable, more manageable and more reliable.</p>
<p>Because of the popularity of the high availability features, <a href="https://ccp.cloudera.com/display/CDH4B2/CDH4+Beta+2+High+Availability+Guide">we&#8217;ve created a high availability guide</a>.  <a href="https://ccp.cloudera.com/display/CDH4B2/CDH4+Beta+2+Documentation">All of the other documentation artifacts</a> have been updated. As always, we maintain complete transparency as to the Apache project releases and patches that make up CDH4. You can find <a href="https://ccp.cloudera.com/display/DOC/CDH+Version+and+Packaging+Information">the documentation for the Apache contents of CDH4 here</a>.</p>
<p>We value your feedback! Please help make this beta a success by trying out CDH4 b2 and letting us know what you think.  If you are a customer, you should give us your feedback via Zendesk. If you are a user but not a customer, please give us your feedback on <a href="https://groups.google.com/a/cloudera.org/group/cdh-user/topics">CDH Users</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/04/introducing-cdh4-beta-2/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>HBaseCon 2012: A Glimpse into the Development Track</title>
		<link>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-development-track/</link>
		<comments>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-development-track/#comments</comments>
		<pubDate>Thu, 12 Apr 2012 22:46:41 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[hbase community]]></category>
		<category><![CDATA[HBase Conference]]></category>
		<category><![CDATA[HBase Event]]></category>
		<category><![CDATA[HBaseCon]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14331</guid>
		<description><![CDATA[HBaseCon 2012 is nearly a month away, and if the conference agenda and attendee registration numbers are good indicators, this will be an annual event you won’t want to miss. Apache HBase is an open source software project that provides users with the ability to do real-time random read/write access to their data in Apache Hadoop. [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hbasecon.com/" target="_blank" title="HBaseCon 2012">HBaseCon 2012</a> is nearly a month away, and if the <a href="http://www.hbasecon.com/agenda" title="HBaseCon 2012 Agenda" target="_blank">conference agenda</a> and attendee registration numbers are good indicators, this will be an annual event you won’t want to miss.</p>
<p>Apache HBase is an open source software project that provides users with the ability to do real-time random read/write access to their data in Apache Hadoop. This means that when you want to use Hadoop for real-time data processing, HBase is the project you are looking for. The HBase developer community includes contributors from many organizations such as StumbleUpon, Facebook, Salesforce.com, TrendMicro, eBay, Explorys, Huawei and Cloudera. In fact, the <a href="http://www.hbasecon.com/program-committee" target="_blank" title="HBaseCon 2012 Program Committee">HBaseCon Program Committee</a>, constructors of the <a href="http://www.hbasecon.com/agenda" title="HBaseCon 2012 Agenda" target="_blank">HBaseCon 2012 agenda</a>, are all committers and PMC members of the Apache HBase project.</p>
<div style="float:right;padding-left:12px"><a href="http://www.hbasecon.com" target="_blank"><img src="http://www.cloudera.com/wp-content/uploads/2012/03/HBaseCon2012-logo-300px.jpg" alt="HBaseCon 2012" width="200px" /></a></div>
<p>Presentations in the HBaseCon 2012 Development track will explain how and why HBase is built the way it is and will also cover HBase schema design and HDFS, the file system on which HBase is most commonly deployed. Some of the presentations for this track include the following below.</p>
<h2 style="font-size:16pt">Development Track Presentations</h2>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Learning HBase Internals</span></a><br />
<a href="http://www.hbasecon.com/speakers/lars-hofhansl/">Lars Hofhansl</a>, Salesforce.com</p>
<p>The strength of an open source project resides entirely in its developer community; a strong democratic culture of participation and hacking makes for a better piece of software. The key requirement is having developers who are not only willing to contribute, but also knowledgeable about the project’s internal structure and architecture. This session will introduce developers to the core internal architectural concepts of HBase, not just “what” it does from the outside, but “how” it works internally, and “why” it does things a certain way. We’ll walk through key sections of code and discuss key concepts like the MVCC implementation and memstore organization. The goal is to convert serious “HBase Users” into HBase Developer Users,” and give voice to some of the deep knowledge locked in the committers’ heads.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Lessons learned from OpenTSDB</span></a><br />
<a href="http://www.hbasecon.com/speakers/benoit-sigoure/">Benoit Sigoure</a>, StumbleUpon</p>
<p>OpenTSDB was built on the belief that, through HBase, a new breed of monitoring systems could be created, one that can store and serve billions of data points forever without the need for destructive downsampling, one that could scale to millions of metrics, and where plotting real-time graphs is easy and fast. In this presentation we’ll review some of the key points of OpenTSDB’s design, some of the mistakes that were made, how they were or will be addressed, and what were some of the lessons learned while writing and running OpenTSDB as well as asynchbase, the asynchronous high-performance thread-safe client for HBase. Specific topics discussed will be around the schema, how it impacts performance and allows concurrent writes without need for coordination in a distributed cluster of OpenTSDB instances.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">HBase Schema Design</span></a><br />
<a href="http://www.hbasecon.com/speakers/ian-varley/">Ian Varley</a>, Salesforce.com</p>
<p>Most developers are familiar with the topic of “database design.” In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">HBase and HDFS: Past, Present, and Future</span></a><br />
<a href="http://www.hbasecon.com/speakers/todd-lipcon/">Todd Lipcon</a>, Cloudera</p>
<p>Apache HDFS, the file system on which HBase is most commonly deployed, was originally designed for high-latency high-throughput batch analytic systems like MapReduce. Over the past two to three years, the rising popularity of HBase has driven many enhancements in HDFS to improve its suitability for real-time systems, including durability support for write-ahead logs, high availability, and improved low-latency performance. This talk will give a brief history of some of the enhancements from Hadoop 0.20.2 through 0.23.0, discuss some of the most exciting work currently under way, and explore some of the future enhancements we expect to develop in the coming years. We will include both high-level overviews of the new features as well as practical tips and benchmark results from real deployments.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Lightning Talk | Relaxed Transactions for HBase<br />
<a href="http://www.hbasecon.com/speakers/francis-liu/">Francis Liu</a>, Yahoo!</p>
<p>For Map/Reduce programmers used to HDFS, the mutability of HBase tables poses new challenges: Data can change over the duration of a job, multiple jobs can write concurrently, writes are effective immediately, and it is not trivial to clean up partial writes. Revision Manager introduces atomic commits and point-in-time consistent snapshots over a table, guaranteeing repeatable reads and protection from partial writes. Revision Manager is optimized for a relatively small number of concurrent write jobs, which is typical within Hadoop clusters. This session will discuss the implementation of Revision Manager using ZooKeeper and coprocessors, and paying extra care to ensure security in multi-tenant clusters. Revision Manager is available as part of the HBase storage handler in HCatalog, but can easily be used stand-alone with little coding effort.</p>
<p><a href="http://www.hbasecon.com/agenda/"><span style="font-size:13pt;color:#BA160C;font-weight:bold">Lightning Talk | Living Data: Applying Adaptable Schemas to HBase<br />
<a href="http://www.hbasecon.com/speakers/aaron-kimball/">Aaron Kimball</a>, WibiData</p>
<p>HBase application developers face a number of challenges: schema management is performed at the application level, decoupled components of a system can break one another in unexpected ways, less-technical users cannot easily access data, and evolving data collection and analysis needs are difficult to plan for. In this talk, we describe a schema management methodology based on Apache Avro that enables users and applications to share data in HBase in a scalable, evolvable fashion. By adopting these practices, engineers independently using the same data have guarantees on how their applications interact. As data collection needs change, applications are resilient to drift in the underlying data representation. This methodology results in a data dictionary that allows less-technical users to understand what data is available to them for analysis and inspect data using general-purpose tools (for example, export it via Sqoop to an RDBMS). And because of Avro’s cross-language capabilities, HBase’s power can reach new domains, like web apps built in Ruby.</p>
<p>&nbsp;</p>
<div> </div>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/04/hbasecon-2012-a-glimpse-into-the-development-track/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Constructing Case-Control Studies With Hadoop</title>
		<link>http://www.cloudera.com/blog/2012/04/constructing-case-control-studies-with-hadoop-healthcare/</link>
		<comments>http://www.cloudera.com/blog/2012/04/constructing-case-control-studies-with-hadoop-healthcare/#comments</comments>
		<pubDate>Wed, 11 Apr 2012 20:32:36 +0000</pubDate>
		<dc:creator>Josh Wills</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[Use Case]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Hadoop and Healthcare]]></category>
		<category><![CDATA[Hadoop Case Study]]></category>
		<category><![CDATA[hadoop use case]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14197</guid>
		<description><![CDATA[San Francisco seems to be having an unusually high number of flu cases/searches this April, and the Cloudera Data Science Team has been hit pretty hard. Our normal activities (working on Crunch, speaking at conferences, finagling a job with the San Francisco Giants) have taken a back seat to bed rest, throat lozenges, and consuming [...]]]></description>
			<content:encoded><![CDATA[<p>San Francisco seems to be having an unusually high number of <a href="http://www.google.org/flutrends/us/#1014221" target="_blank">flu cases/searches this April</a>, and the Cloudera Data Science Team has been hit pretty hard. Our normal activities (working on <a href="http://github.com/cloudera/crunch" target="_blank">Crunch</a>, speaking at conferences, finagling a job with the San Francisco Giants) have taken a back seat to bed rest, throat lozenges, and consuming massive quantities of orange juice. But this bit of downtime also gave us an opportunity to focus on solving a large-scale data science problem that helps some of the people who help humanity the most: <a href="http://www.bls.gov/ooh/Life-Physical-and-Social-Science/Epidemiologists.htm" target="_blank">epidemiologists</a>.</p>
<h2>Case-Control Studies</h2>
<p>A <a href="http://en.wikipedia.org/wiki/Case-control_study" target="_blank">case-control study</a> is a type of observational study in which a researcher attempts to identify the factors that contribute to a medical condition by comparing a set of subjects who have that condition (the &#8216;cases&#8217;) to a set of subjects who do not have the condition, but otherwise resemble the case subjects (the &#8216;controls&#8217;). They are useful for exploratory analysis because they are relatively cheap to perform, and have led to many important discoveries- most famously, the link between <a href="http://news.bbc.co.uk/2/hi/health/3826939.stm" target="_blank">smoking and lung cancer</a>.</p>
<p>Epidemiologists and other researchers now have access to data sets that contain tens of millions of anonymized patient records. Tens of thousands of these patient records may include a particular disease that a researcher would like to analyze. In order to find enough unique control subjects for each case subject, a researcher may need to execute tens of thousands of queries against a database of patient records, and I have spoken to researchers who spend <em>days</em> performing this laborious task. Although they would like to parallelize these queries across multiple machines, there is a constraint that makes this problem a bit more interesting: each control subject may only be matched with at most one case subject. If we parallelize the queries across the case subjects, we need to check to be sure that we didn&#8217;t assign a control subject to multiple cases. If we parallelize the queries across the control subjects, we need to be sure that each case subject ends up with a sufficient number of control subjects. In either case, we still need to query the data an arbitrary number of times to ensure that the matching of cases and controls we come up with is feasible, let alone optimal.</p>
<p>Designing and analyzing a case-control study is a problem for a statistician. <em>Constructing</em> a case-control study is a problem for a data scientist.</p>
<h2>Applied Auction Theory</h2>
<p>We can think of constructing a case-control study as an <a href="http://amstat.tandfonline.com/doi/abs/10.1198/106186001317114938" target="_blank">assignment problem</a>: we have a bipartite graph, where one set of nodes represents the cases, one set of nodes represents the controls, and the edges between the cases and controls are weighted by the quality of the match between the subjects as determined by the researcher. If a particular case-control pair would not be a suitable match under any circumstances because the patients are not similar enough, there is no edge between them.</p>
<p style="text-align: center;font-weight: bold"><a href="http://www.cloudera.com/wp-content/uploads/2012/04/bipartite1.png"><img class="size-full wp-image-14219" src="http://www.cloudera.com/wp-content/uploads/2012/04/bipartite1.png" alt="" width="1662" height="938" /></a>A small assignment problem</p>
<p>Although MapReduce is great for finding compatible case-control pairs and computing the weights we want to assign to those matches, it&#8217;s not ideal for the kinds of iterative, graph-based computations that we need to do in order to solve the assignment problem. After we use MapReduce to prepare the input, we turn to <a href="http://incubator.apache.org/giraph/" target="_blank">Apache Giraph</a>, a Java library that makes it easy to perform fast, distributed graph processing on Hadoop clusters, to assign cases to controls.</p>
<p>Although there are lots of different algorithms for solving the assignment problem, our implementation is based on <a href="http://web.mit.edu/dimitrib/www/home.html" target="_blank">Bertsekas</a>&#8216; <a href="http://18.7.29.232/bitstream/handle/1721.1/3154/P-1908-20783037.pdf?sequence=1" target="_blank">auction algorithm</a>. The core idea of the algorithm is that the case subjects will bid for the right to be matched with control subjects over a series of rounds, with the bids computed based on the edge weights. Assuming that all of the weights are integers, the auction algorithm is guaranteed to converge to an assignment of cases to controls that maximizes the sum of the weights of the matched pairs. Bertsekas&#8217; algorithm is also very easy to parallelize, and has excellent performance on assignment problems that are relatively sparse (i.e., each node is only connected to a small fraction of the total nodes.)</p>
<p style="text-align: center;font-weight: bold"><a href="http://www.cloudera.com/wp-content/uploads/2012/04/matched.png"><img class="size-full wp-image-14222" src="http://www.cloudera.com/wp-content/uploads/2012/04/matched.png" alt="" width="1601" height="901" /></a>An optimal matching</p>
<h2>Do It Yourself</h2>
<p>Our <a href="https://github.com/cloudera/matching" target="_blank">toolkit for constructing case-control studies</a> is available on Cloudera&#8217;s github repository, and is released under the Apache License. To get started, you will need a cluster that has <a href="http://zookeeper.apache.org/" target="_blank">Apache Zookeeper</a> installed, which is easy to do on local servers using the free edition of <a href="http://www.cloudera.com/products-services/tools/">Cloudera Manager</a>, or in a cloud environment via the version of <a href="https://ccp.cloudera.com/display/CDHDOC/Whirr+Installation" target="_blank">Apache Whirr in CDH3</a>. If you are just getting started with Hadoop and run into any issues, <a href="http://www.cloudera.com/hadoop-support/">Cloudera Support</a> is happy to help.</p>
<p>This work, like a lot of the work we do, started out as a conversation with a Cloudera customer about a challenge they were facing. If you have a data problem, if no one else can help, and if you can provide chicken soup, maybe you can hire the <a href="http://www.youtube.com/watch?v=yrK0rZj6pes" target="_blank">Cloudera Data Science Team</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/04/constructing-case-control-studies-with-hadoop-healthcare/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Sqoop Graduation Meetup</title>
		<link>http://www.cloudera.com/blog/2012/04/sqoop-graduation-meetup/</link>
		<comments>http://www.cloudera.com/blog/2012/04/sqoop-graduation-meetup/#comments</comments>
		<pubDate>Tue, 10 Apr 2012 17:04:27 +0000</pubDate>
		<dc:creator>Kathleen Ting</dc:creator>
				<category><![CDATA[Connector]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[sqoop]]></category>
		<category><![CDATA[Apache Sqoop]]></category>
		<category><![CDATA[Hadoop connector]]></category>
		<category><![CDATA[Hadoop SQL]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14154</guid>
		<description><![CDATA[This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/sqoop_graduation_meetup Cloudera hosted the Apache Sqoop Meetup last week at Cloudera HQ in Palo Alto. About 20 of the Meetup attendees had not used Sqoop before, but were interested enough to participate in the Meetup on April 4th. We believe this healthy interest in Sqoop will contribute to [...]]]></description>
			<content:encoded><![CDATA[<p><em>This blog was originally posted on the Apache Blog:<br />
<a href="https://blogs.apache.org/sqoop/entry/sqoop_graduation_meetup">https://blogs.apache.org/sqoop/entry/sqoop_graduation_meetup</a></em></p>
<p>Cloudera hosted the <a href="http://www.meetup.com/Sqoop-User-Meetup/events/56531992/">Apache Sqoop Meetup</a> last week at Cloudera HQ in Palo Alto. About 20 of the Meetup attendees had not used Sqoop before, but were interested enough to participate in the Meetup on April 4th. We believe this healthy interest in Sqoop will contribute to its wide adoption. </p>
<p>Not only was this Sqoop&#8217;s second Meetup but also a celebration for <a href="https://blogs.apache.org/sqoop/entry/apache_sqoop_graduates_from_incubator">Sqoop&#8217;s graduation</a> from the Incubator, cementing its status as a Top-Level Project in Apache Software Foundation. Sqoop&#8217;s come a long way since its beginnings three years ago as a contrib module for Apache Hadoop submitted by Aaron Kimball. As a result, it was fitting that Aaron gave the first talk of the night by discussing its history: &#8220;<a href="https://cwiki.apache.org/confluence/download/attachments/27361435/Aaron_sqoop+meetup+2012-04-03.pdf?version=1&amp;modificationDate=1333763962392">Sqoop: The Early Days</a>.&#8221; From Aaron, we learned that Sqoop’s original name was &#8220;SQLImport&#8221; and that it was conceived out of his frustration from the inability to easily query both unstructured and structured data at the same time.</p>
<p>Closing out the evening, Arvind Prabhakar described the &#8220;<a href="https://cwiki.apache.org/confluence/download/attachments/27361435/Sqoop2_wnotes.pdf?version=1&amp;modificationDate=1326152997000">Highlights of Sqoop 2</a>.” Sqoop 2 will enable users to use Sqoop effectively with a minimal understanding of its details. For instance, by having a web-application run Sqoop, Sqoop can be installed once and used from anywhere. Among other goals for the project are ease of the development of connectors and security enhancements.</p>
<p>With the conclusion of the scheduled talks, the graduation cake was cut, the <a href="http://t.co/8mruRgAC">swag</a> was passed out, and the hallway talks commenced on the anticipated features of Sqoop 2. We encourage you to participate in and contribute to Sqoop 2&#8242;s <a href="https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2">Design</a> and <a title="" href="https://issues.apache.org/jira/browse/SQOOP-365" target="">Implementation</a>.</p>
<p><img class="alignnone" src="https://blogs.apache.org/sqoop/mediaresource/3fc62680-2501-453e-9063-0d4009ced1cf" alt="" width="480" height="320" /></p>
<p><img class="alignnone" src="https://blogs.apache.org/sqoop/mediaresource/e16282d9-5c9e-40b1-b0e9-e70fc74309de" alt="" width="480" height="320" /></p>
<p><img class="alignnone" src="https://blogs.apache.org/sqoop/mediaresource/42f12675-0bba-41b7-a7db-dfcae1323cdf" alt="" width="480" height="320" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/04/sqoop-graduation-meetup/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>HBase Hackathon at Cloudera</title>
		<link>http://www.cloudera.com/blog/2012/04/hbase-hackathon-at-cloudera/</link>
		<comments>http://www.cloudera.com/blog/2012/04/hbase-hackathon-at-cloudera/#comments</comments>
		<pubDate>Fri, 06 Apr 2012 23:32:08 +0000</pubDate>
		<dc:creator>David S. Wang</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Apache HBase]]></category>
		<category><![CDATA[HBase Hackathon]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14127</guid>
		<description><![CDATA[Cloudera will be hosting an Apache HBase hackathon on May 23rd, 2012, the day after HBaseCon 2012.  The overall theme of the event will be 0.96 stabilization.  If you are in the area for HBaseCon, please come down to our offices in Palo Alto the next day to attend the hackathon.  This is a great [...]]]></description>
			<content:encoded><![CDATA[<p>Cloudera will be hosting an Apache HBase <a title="HBase hackathon Meetup page " href="http://www.meetup.com/hackathon/events/58953522/" target="_blank">hackathon</a> on May 23rd, 2012, the day after <a title="HBaseCon 2012" href="http://hbasecon.com" target="_blank">HBaseCon 2012</a>.  The overall theme of the event will be 0.96 stabilization.  If you are in the area for HBaseCon, please come down to our offices in Palo Alto the next day to attend the hackathon.  This is a great opportunity to contribute some code towards the project and hang out with other HBasers.</p>
<p>More details are on the hackathon&#8217;s <a title="HBase hackathon Meetup page" href="http://www.meetup.com/hackathon/events/58953522/" target="_blank">Meetup</a> page.  Please RSVP so we can better plan lunch, room size, and other logistics for the event.  See you there!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/04/hbase-hackathon-at-cloudera/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Apache Bigtop 0.3.0 (incubating) has been released</title>
		<link>http://www.cloudera.com/blog/2012/04/apache-bigtop-0-3-0-incubating-has-been-released/</link>
		<comments>http://www.cloudera.com/blog/2012/04/apache-bigtop-0-3-0-incubating-has-been-released/#comments</comments>
		<pubDate>Tue, 03 Apr 2012 17:58:56 +0000</pubDate>
		<dc:creator>Roman Shaposhnik</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[apache bigtop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14011</guid>
		<description><![CDATA[Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested: Apache Hadoop 1.0.1 [...]]]></description>
			<content:encoded><![CDATA[<p>Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested:</p>
<ul style="padding-left:20px">
<li>Apache Hadoop 1.0.1</li>
<li>Apache Zookeeper 3.4.3</li>
<li>Apache HBase 0.92.0</li>
<li>Apache Hive 0.8.1</li>
<li>Apache Pig 0.9.2</li>
<li>Apache Mahout 0.6.1</li>
<li>Apache Oozie 3.1.3</li>
<li>Apache Sqoop 1.4.1</li>
<li>Apache Flume 1.0.0</li>
<li>Apache Whirr 0.7.0</li>
</ul>
<p>The list of supported Linux platforms has expanded to:</p>
<ul style="padding-left:20px">
<li>Fedora 15 and 16</li>
<li>CentOS and Red Hat Enterprise Linux 5 and 6</li>
<li>SuSE Linux Enterprise 11</li>
<li>Ubuntu 10.04 LTS</li>
<li>Mageia 1</li>
</ul>
<p>This, we hope, will make our user community&#8217;s experience running Apache Hadoop the most seamless Bigtop experience to date: just follow our<a title="Installation Guide" href="https://cwiki.apache.org/confluence/display/BIGTOP/How+to+install+Hadoop+distribution+from+Bigtop" target="_blank"> Installation Guide </a>and you will have your first pseudo-distributed Hadoop PI or Hive query running in no time.</p>
<p>If you&#8217;re thinking about deploying Bigtop to a fully-distributed cluster you might find our improved <a title="Puppet" href="http://puppetlabs.com/" target="_blank">Puppet</a> code to be of assistance. There is some <a title="brief documentation" href="https://svn.apache.org/repos/asf/incubator/bigtop/branches/branch-0.3/bigtop-deploy/puppet/README.md">brief documentation</a>  on how to run our Puppet recipes in a master-less puppet configuration, but they should work just fine in a typical Puppet master setup as well.</p>
<p>Whatever you do, don&#8217;t forget to check us out at <a title="Apache" href="http://incubator.apache.org/bigtop/" target="_blank">Apache</a> and consider getting involved. Bigtop is a community-driven effort and we need your help. Of course, above all we need you to use Bigtop and give us your the feedback.</p>
<p>Happy Big Data discoveries,<br />Your faithful and tireless Bigtop development team!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/04/apache-bigtop-0-3-0-incubating-has-been-released/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

