Developer Center
Cloudera Blog · CDH Posts
CDH3 update 4 is now available

We are happy to officially announce the general availability of CDH3 update 4. This update consists primarily of reliability enhancements as well as a number of minor improvements.

First, there have been a few notable HBase updates. In this release, we’ve upgraded Apache HBase to upstream version 0.90.6, improving system robustness and availability. Also, some of the recent hbck changes were incorporated to better detect and handle various types of corruptions. Lastly, HDFS append support is now disabled by default in this release as it is no longer needed for HBase. Please see the CDH3 Known Issues and Workarounds page for details.

In addition to the HBase updates, CDH3 update 4 also includes the latest release of Apache Flume (incubating) – version 1.1.0. A detailed description of what it brings to the table is found in a previous Cloudera blog post describing its architecture. Please note that we will continue to ship Flume 0.9.4 as well.

How Treato Analyzes Health-related Social Media Big Data with Hadoop and HBase

This is a guest post by Assaf Yardeni, Head of R&D for Treato, an online social healthcare solution, headquartered in Israel.

Three years ago I joined Treato, a social healthcare analysis firm to help treato.com scale up to its present capability. Treato is a new source for healthcare information where health-related user generated content (UGC) from the Internet is aggregated and organized into usable insights for patients, physicians and other healthcare professionals. With oceans of patient-written health-related information available on the Web, and more being published each day, Treato needs to be able to collect and process vast amounts of data – Treato is Big Data par excellence, and my job has been to bring Treato to this stage.

Before the Hadoop era

When I arrived at Treato, the team had already developed a Microsoft-based prototype that could organize a limited amount of health-related UGC into relevant insights, as a proof of concept. The system would:

Introducing CDH4 Beta 2

I’m pleased to inform our users and customers that we have released the Cloudera’s Distribution Including Apache Hadoop version 4 (CDH4) 2nd and final beta today. We received great feedback from the community from the first beta and this release incorporates that feedback as well as a number of new enhancements.

CDH4 has a great many enhancements compared to CDH3.

High Availability for the Hadoop Distributed File System (HDFS)

Background

Apache Hadoop consists of two primary components: HDFS and MapReduce. HDFS, the Hadoop Distributed File System, is the primary storage system of Hadoop, and is responsible for storing and serving all data stored in Hadoop. MapReduce is a distributed processing framework designed to operate on data stored in HDFS.

HDFS has long been considered a highly reliable file system.  An empirical study done at Yahoo! concluded that across Yahoo!’s 20,000 nodes running Apache Hadoop in 10 different clusters in 2009, HDFS lost only 650 blocks out of 329 million total blocks. The vast majority of these lost blocks were due to a handful of bugs which have long since been fixed.

Despite this very high level of reliability, HDFS has always had a well-known single point of failure which impacts HDFS’s availability: the system relies on a single Name Node to coordinate access to the file system data. In clusters which are used exclusively for ETL or batch-processing workflows, a brief HDFS outage may not have immediate business impact on an organization; however, in the past few years we have seen HDFS begin to be used for more interactive workloads or, in the case of HBase, used to directly serve customer requests in real time. In cases such as this, an HDFS outage will immediately impact the productivity of internal users, and perhaps result in downtime visible to external users. For these reasons, adding high availability (HA) to the HDFS Name Node became one of the top priorities for the HDFS community.

Thoughts on Cloudera and Cisco UCS reference architecture for Hadoop

Cloudera and Cisco jointly announced a reference architecture for running Cloudera’s Distribution Including Apache Hadoop (CDH) and Cloudera Manager on Cisco’s Unified Computing System (UCS) last November. It was the first Apache Hadoop reference architecture assembled by Cisco, and is proudly certified by Cloudera.

I bring a different perspective on the Cloudera-Cisco relationship, as I worked for over five years in Cisco on the software powering the Nexus 5000 series switches and the Cisco Virtual Interface Card. I now work at Cloudera on the HBase team, and can fully appreciate the synergies that the Cloudera and Cisco reference architecture brings to the table.

I know that Cisco is actively investing in development and support for the rack servers and switches that comprise UCS. Working at Cloudera now, I see the tireless effort that goes into each CDH and Cloudera Manager release from leaders in the Apache Hadoop community.

Indexing Files via Solr and Java MapReduce

Several weeks ago, I set about to demonstrate the ease with which Solr and Map/Reduce can be integrated. I was unable to find a simple, yet comprehensive, primer on integrating the two technologies. So I set about to write one.

What follows is my bare-bones tutorial on getting Solr up and running to index each word of the complete works of Shakespeare. Note: Special thanks to Sematext for looking over the Solr bits and making sure they are sane. Check them out if you’re going to be doing a lot of work with Solr, ElasticSearch, or search in general and want to bring in the experts.

First things first

The way that I got started was by instantiating a new CentOS 6 Virtual Machine. You can pick a different flavor of Linux if that suits you; Hadoop should work fine on any (though advocated distros are SuSE, Ubuntu/Debian, RedHat/CentOS).

MapReduce 2.0 in Hadoop 0.23

In Building and Deploying MR2 we presented a brief introduction to MapReduce in Hadoop 0.23 and focused on the steps to set up a single-node cluster. This blog provides developers with architectural details of the new MapReduce design. 

Apache Hadoop 0.23 has major improvements over previous releases. Here are a few highlights on the MapReduce front; note that there are also major HDFS improvements, which are out of scope of this post.

MapReduce 2.0 (a.k.a. MRv2 or YARN):

The new architecture divides the two major functions of the JobTracker – resource management and job life-cycle management – into separate components:

Introducing CDH4

I’m pleased to inform our users and customers that Cloudera has released its 4th version of Cloudera’s Distribution Including Apache Hadoop (CDH) into beta today. This release combines the input from our enterprise customers, partners and users with the hard work of Cloudera engineering and the larger Apache open source community to create what we believe is a compelling advance for this widely adopted platform.

There are a great many improvements and new capabilities in CDH4 compared to CDH3. Here is a high level list of what’s available for you to test in this first beta release:

Cloudera Connector for Tableau Has Been Released

Earlier today, Cloudera proudly released the Cloudera Connector for Tableau. The availability of this connector serves both Tableau users who are looking to expand the volume of datasets they manipulate and Hadoop users who want to enable analysts like Tableau users to make the data within Hadoop more meaningful. Enterprises can now extract the full value of big data and allow a new class of power users to interact with Hadoop data in ways they priorly could not.

The Cloudera Connector for Tableau is a free ODBC Driver that enables Tableau Desktop 7.0 to connect to Apache Hive. Tableau users can thus leverage Hive, Hadoop’s data warehouse system, as a data source for all the maps, charts, dashboards and other artifacts typically generated within Tableau.

Hive itself is a powerful query engine that is optimized for analytic workloads, and that’s where this Connector is sure to work best. Tableau also, however, lets users ingest result sets from Hive into its in-memory analytical engine so that results returning from Hadoop can be analyzed much more quickly.

CDH3, update 3 now available

Keeping with our release policy for Cloudera’s Distribution Including Apache Hadoop (CDH) I’m pleased to announce the availability of update 3 for CDH3.  As a reminder, we ship updates for our most recent GA distribution every 3 months.  Updates primarily include bug fixes but when possible we will also include features from our mid-term roadmap.  We’ll only include new features when they do not introduce instability or break compatibility.  As always, users have the option to skip updates without incurring any future upgrade cost.

Update 3 contains a number of new improvements.  Several improvements positively impact performance.  Enhancements were made to HDFS and to HBase which will result in 15-150% improvements in performance compared to CDH3 update 2 depending on the workload.  Users should see performance gains in a wide range of workloads from MapReduce over HDFS style workloads to HBase scan style workloads to HBase random read / write workloads.  Todd Lipcon’s talk at Hadoop World on performance outlines a number of these improvements that have made it to update 3.  Some of these performance improvements require users to select specific configuration settings so please consult the documentation.

A number of other improvements have been made that will help system stability and recover-ability. Enhancements were made to MapReduce to better work around disk failures without impacting task locality.  We also backported the Apache HBase distributed log splitting feature that will make recovery from region failure much faster than it has been previously.