Developer Center
Cloudera Blog · General Posts
Apache HBase 0.94 is now released

Apache HBase 0.94.0 has been released! This is the first major release since the January 22nd HBase 0.92 release. In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes).

Performance Related JIRAs

Below are a few of the important performance related JIRAs:

Meet the Presenter: Todd Lipcon

Today’s interview features Todd Lipcon, software engineer for Cloudera. Todd will be presenting Optimizing MapReduce Job Performance at Hadoop Summit.

Question: Tell us about your current role and how you interact with Apache Hadoop?

Todd: I’m a software engineer on Cloudera’s platform engineering team, where I spend most of my time contributing code to open source projects like Apache Hadoop and HBase. Most recently I’ve been implementing the automatic HA failover feature in Hadoop 2.0, but I’ve also spent a lot of time working on understanding and improving performance of the Hadoop stack.

Question: Tell us about your Hadoop Summit presentation?

Todd: At this year’s summit, I will be presenting about the internals of MapReduce and how you can tune your MapReduce jobs for optimal performance. A lot of developers see MapReduce as a black box, but looking inside that box can help you understand where you might have bottlenecks or easy opportunities to improve performance by changing a few configuration parameters.

Question: What do you expect will be the key takeaway for folks attending your session?

Cloudera Manager 4.0 Beta released

We’re happy to announce the Beta release of Cloudera Manager 4.0. 

This version of Cloudera Manager includes support for CDH4 Beta2 and several new features for both the Free edition and the Enterprise edition.

Please try it out and send your comments to beta@cloudera.com. As always, we look forward to your feedback. 

CDH3 update 4 is now available

We are happy to officially announce the general availability of CDH3 update 4. This update consists primarily of reliability enhancements as well as a number of minor improvements.

First, there have been a few notable HBase updates. In this release, we’ve upgraded Apache HBase to upstream version 0.90.6, improving system robustness and availability. Also, some of the recent hbck changes were incorporated to better detect and handle various types of corruptions. Lastly, HDFS append support is now disabled by default in this release as it is no longer needed for HBase. Please see the CDH3 Known Issues and Workarounds page for details.

In addition to the HBase updates, CDH3 update 4 also includes the latest release of Apache Flume (incubating) – version 1.1.0. A detailed description of what it brings to the table is found in a previous Cloudera blog post describing its architecture. Please note that we will continue to ship Flume 0.9.4 as well.

Introducing CDH4 Beta 2

I’m pleased to inform our users and customers that we have released the Cloudera’s Distribution Including Apache Hadoop version 4 (CDH4) 2nd and final beta today. We received great feedback from the community from the first beta and this release incorporates that feedback as well as a number of new enhancements.

CDH4 has a great many enhancements compared to CDH3.

HBaseCon 2012: A Glimpse into the Development Track

HBaseCon 2012 is nearly a month away, and if the conference agenda and attendee registration numbers are good indicators, this will be an annual event you won’t want to miss.

Apache HBase is an open source software project that provides users with the ability to do real-time random read/write access to their data in Apache Hadoop. This means that when you want to use Hadoop for real-time data processing, HBase is the project you are looking for. The HBase developer community includes contributors from many organizations such as StumbleUpon, Facebook, Salesforce.com, TrendMicro, eBay, Explorys, Huawei and Cloudera. In fact, the HBaseCon Program Committee, constructors of the HBaseCon 2012 agenda, are all committers and PMC members of the Apache HBase project.

Constructing Case-Control Studies With Hadoop

San Francisco seems to be having an unusually high number of flu cases/searches this April, and the Cloudera Data Science Team has been hit pretty hard. Our normal activities (working on Crunch, speaking at conferences, finagling a job with the San Francisco Giants) have taken a back seat to bed rest, throat lozenges, and consuming massive quantities of orange juice. But this bit of downtime also gave us an opportunity to focus on solving a large-scale data science problem that helps some of the people who help humanity the most: epidemiologists.

Case-Control Studies

A case-control study is a type of observational study in which a researcher attempts to identify the factors that contribute to a medical condition by comparing a set of subjects who have that condition (the ‘cases’) to a set of subjects who do not have the condition, but otherwise resemble the case subjects (the ‘controls’). They are useful for exploratory analysis because they are relatively cheap to perform, and have led to many important discoveries- most famously, the link between smoking and lung cancer.

Epidemiologists and other researchers now have access to data sets that contain tens of millions of anonymized patient records. Tens of thousands of these patient records may include a particular disease that a researcher would like to analyze. In order to find enough unique control subjects for each case subject, a researcher may need to execute tens of thousands of queries against a database of patient records, and I have spoken to researchers who spend days performing this laborious task. Although they would like to parallelize these queries across multiple machines, there is a constraint that makes this problem a bit more interesting: each control subject may only be matched with at most one case subject. If we parallelize the queries across the case subjects, we need to check to be sure that we didn’t assign a control subject to multiple cases. If we parallelize the queries across the control subjects, we need to be sure that each case subject ends up with a sufficient number of control subjects. In either case, we still need to query the data an arbitrary number of times to ensure that the matching of cases and controls we come up with is feasible, let alone optimal.

Sqoop Graduation Meetup

This blog was originally posted on the Apache Blog:
https://blogs.apache.org/sqoop/entry/sqoop_graduation_meetup

Cloudera hosted the Apache Sqoop Meetup last week at Cloudera HQ in Palo Alto. About 20 of the Meetup attendees had not used Sqoop before, but were interested enough to participate in the Meetup on April 4th. We believe this healthy interest in Sqoop will contribute to its wide adoption. 

Not only was this Sqoop’s second Meetup but also a celebration for Sqoop’s graduation from the Incubator, cementing its status as a Top-Level Project in Apache Software Foundation. Sqoop’s come a long way since its beginnings three years ago as a contrib module for Apache Hadoop submitted by Aaron Kimball. As a result, it was fitting that Aaron gave the first talk of the night by discussing its history: “Sqoop: The Early Days.” From Aaron, we learned that Sqoop’s original name was “SQLImport” and that it was conceived out of his frustration from the inability to easily query both unstructured and structured data at the same time.

HBase Hackathon at Cloudera

Cloudera will be hosting an Apache HBase hackathon on May 23rd, 2012, the day after HBaseCon 2012.  The overall theme of the event will be 0.96 stabilization.  If you are in the area for HBaseCon, please come down to our offices in Palo Alto the next day to attend the hackathon.  This is a great opportunity to contribute some code towards the project and hang out with other HBasers.

More details are on the hackathon’s Meetup page.  Please RSVP so we can better plan lunch, room size, and other logistics for the event.  See you there!

Apache Bigtop 0.3.0 (incubating) has been released

Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested: