Cloudera customers usually have two major sources of data: log files, which can be imported to Hadoop via Flume, and relational databases. Throughout the previous releases of CDH2 and CDH3, Cloudera has included a package we’ve developed called Sqoop. Sqoop can perform batch imports and exports between relational databases and Hadoop, storing data in HDFS and creating Hive tables to hold results. We described its motivation and some use cases in a previous blog post a while ago. In CDH3b2, we’ve included a greatly-expanded version of Sqoop which has had a major overhaul since previous releases. This version is important enough that we’re deeming it the “1.0″ release of Sqoop. In this blog post we’ll cover the highlights of the new features available in Sqoop.
New Interface
The biggest change you’ll notice is that the Sqoop command-line interface has completely changed. Users who have been embedding Sqoop in scripts may be frustrated by this incompatible change, but we think that given the amount of functionality available in Sqoop now, some refactoring is necessary, and this is the correct opportunity to do it. Sqoop is now arranged as a set of tools. If you type sqoop help, you’ll see the list of tools available. Most of the original funtionality is contained in a tool called import; running sqoop help import will list the options available to this tool.
Improved Export Performance
In CDH3b1 we provided basic support for exports: the ability to take results from HDFS and insert them back into a database. CDH3b2 features a completely rewritten export pipeline which demonstrates considerably greater throughput and scalability. You can now export gigabytes of data with high performance. For MySQL users, we’ve added a separate “direct mode” channel that uses mysqlimport to perform this job even faster.
Large Object Support
The amount one can learn from data is often immense, and CDH makes it easier for analysts, researchers, programmers, and organizations to extract meaningful information from their data. We’d like to invite you to come build, play, and experiment with our newest release of CDH, CDH3, at our hackathon on July 27th. The CDH hackathon will be an opportunity to team up with Cloudera employees and other community members to use CDH3 to solve fun and interesting problems. Each attendee is encouraged to come to the hackathon with an idea about what to build. Then, at the beginning of the day, we’ll find groups to work with and start hacking. Some example projects might be to collect Twitter data and perform social graph analysis. Or maybe to analyze the Enron email data to find friend circles and categorize them into fraudulent or innocent. At the end of the day we’ll do an anonymous poll and vote on the best project. There will be a prize for the winning group!
We’ll get started at 9:30am on July 27th at Cloudera Headquarters. We’ll finish sometime around 7:30pm. Lunch will be provided by our investor Accel.
In summary:
What: Build applications using CDH to do cool things; collaborate with the community and Cloudera employees
Where: Cloudera HQ — 210 Portage Ave., Palo Alto, CA 94306
When: Tuesday July 27, 2010, 9:30am
Hadoop has emerged as an indispensable component of any data-intensive enterprise infrastructure. In many ways, working with large datasets on a distributed computing platform (powered by commodity hardware or cloud infrastructure) has never been easier. But because customers are running clusters consisting of hundreds or thousands of nodes, and are processing massive quantities of data from production systems every hour, the logistics of efficient platform utilization can quickly become overwhelming.
To deal with this challenge, the Yahoo! engineering team created Oozie – the Hadoop workflow engine. We are pleased to provide Oozie with Cloudera’s distribution for Hadoop starting with the beta-2 release.
Why create a new workflow system?
You might wonder why a new workflow system is necessary for Hadoop, given that there are quite a few existing commercial and open-source systems available. While it is possible to use existing general-purpose workflow systems with Hadoop, it is anything but simple. Intricacies such as monitoring long running jobs and interfacing with the distributed file system require extensive work to port general workflow systems to the Hadoop environment. Oozie, on the other hand, is designed specifically for the Hadoop platform and uses it as its execution environment. It has built-in support for Hadoop tasks and integrates with this environment cleanly. Oozie itself is fairly light-weight, requires minimal configuration, and scales linearly – thus offering a sustainable approach to building workflows in the Hadoop environment.
CDH3 beta 2 includes Pig 0.7.0, the latest and greatest version of the popular dataflow programming environment for Hadoop. In this post I’ll review some of the bigger changes that went into Pig 0.7.0, describe the motivations behind these changes, and explain how they affect users. Readers in search of a canonical list of changes in this new version of Pig should consult the Pig 0.7.0 Release Notes as well as the list of backward incompatible changes.
Load-Store Redesign
The biggest change to appear in Pig 0.7.0 is the complete redesign of the LoadFunc and StoreFunc interfaces. The Load-Store interfaces were first introduced in version 0.1.0 and have remained largely unchanged up to this point. Pig uses a concrete instance of the LoadFunc interface to read Pig records from the underlying storage layer, and similarly uses an instance of the StoreFunc interface when it needs to write a record. Pig provides different LoadFunc and StoreFunc implementations in order to support different storage formats, and since this is a public interface users may provide their own implementations as well.
The primary motivation for redesigning these interfaces is to bring them into closer alignment with Hadoop’s InputFormat and OutputFormat interfaces, with the goal of making it much easier to write new LoadFunc and StoreFunc implementations based on existing Hadoop InputFormat and OutputFormat classes. At the same time the new interfaces were also made a lot more powerful by providing direct access to configurations as well as the ability to selectively read individual columns.
As part of our series of announcements at the recent Hadoop Summit, Cloudera released two of its previously internal projects into open source. One of those was the HUE user interface environment, which we’ll be saying a bit more about later this week. The other was our data movement platform Flume. We’ve been working on Flume for many months, and it’s really exciting to be able to share the details of what we’ve been doing. In this blog post I’d like to introduce Flume to the world, and say a little about how it might help problems with data collection that you might be facing right now.
What is Flume?
We’ve seen our customers have great success using Hadoop for processing their data, but the question of how to get the data there to process in the first place was often significantly more challenging. Many customers had produced ad-hoc solutions with complicated shell scripts and periodically running batch copies. Such solutions, while minimally effective, don’t allow the user any insight into how they were running, whether or not they were succeeding and whether or not any data were being lost. Changing or reconfiguring the scripts to collect more or different data was hard. They were, at best, ‘hit and hope’ solutions.
At the same time, we observed that much more data are being produced than most organisations have the software infrastructure to collect. We are very keen to allow our users to take advantage of all the data that their cluster is generating. Looking around, we saw no solutions that supported all the features that we wanted to provide to our customers, incuding reliable delivery of data and an easy configuration system that didn’t involve logging in to a hundred machines to restart a process, as well a powerful extensibility solution for easy integration with a wide variety of data sources.
- by Patrick Hunt
- July 12, 2010
- no comments
CDH3 beta 2 is the first version of CDH to incorporate Apache ZooKeeper. ZooKeeper is a highly reliable and available coordination service for distributed processes. It is a proven technology and a well established open source project at Apache (sub-project of Hadoop).
ZooKeeper is distributed coordination
Often distributed applications need some way to coordinate across processes; locking resources, managing queues of events, electing a “leader” process, configuration, etc… Coordination operations such as these are notoriously hard to get right. ZooKeeper provides a relatively simple API which allows clients to correctly implement these and many other coordination mechanisms.
ZooKeeper is itself a replicated service based on a quorum algorithm. One or more ZooKeeper servers form what’s called an “ensemble”, which are in constant communication. As the size of the ensemble increases the reliability of the service itself increases – as long as a majority of the configured ensemble servers are available the service is available. As an example, say you have an ensemble of size three (three ZooKeeper servers), if one of the three fail the service is still “up”. If two of the three fail the service is down. One could run with five servers, in which case if two servers fail the service as a whole would still be available. Seven server ensembles can survive three failures, and so on.
Over the last two years, Cloudera has helped a great number of customers achieve their business objectives on the Hadoop platform. In doing so, we’ve confirmed time and time again that HDFS and MapReduce provide an incredible platform for batch data analysis and other throughput-oriented workloads. However, these systems don’t inherently provide subsecond response times or allow random-access updates to existing datasets. HBase, one of the new components in CDH3b2, addresses these limitations by providing a real time access layer to data on the Hadoop platform.
HBase is a distributed database modeled after BigTable, an architecture that has been in use for hundreds of production applications at Google since 2006. The primary goal of HBase is simple: HBase provides an immensely scalable data store with soft real-time random reads and writes, a flexible data model suitable for structured or complex data, and strong consistency semantics for each row. As HBase is built on top of the proven HDFS distributed storage system, it can easily scale to many TBs of data, transparently handling replicated storage, failure recovery, and data integrity.
In this post I’d like to highlight two common use cases for which we think HBase will be particularly effective:
Analysis of continuously updated data
In this post I’ll cover some of the larger or more significant changes that have gone into core Hadoop in CDH3 beta 2.
The Hadoop in CDH3 is based on the latest Apache Hadoop core release – version 0.20.2 – which was released February 26th, 2010. Details of what changed in the Apache Hadoop dot release can be found in the release notes and change log. We’ve included hundreds of additional bug fixes, improvements and features atop the base Apache release. You can see these in the CDH3 beta 2 release notes and change log. The version (hadoop-0.20.2+320) indicates the base Apache Hadoop release version and the number of additional patches that have been applied to this release. The changes in CDH3 beta 2 are primarily focused on improving Hadoop’s internals as opposed to adding user-facing APIs.
The biggest addition to CDH3 beta 2 is the incorporation of the 0.20 append branch to enable HBase. The 0.20 append branch is a version of Hadoop 0.20 that supports the sync method to provide durability for the HBase edits log. See the this page on the HBase wiki and look out for an upcoming post by Todd Lipcon for more detail.
Today’s a big day for us at Cloudera. We’re announcing, as part of our activity at Hadoop Summit, two major new releases that we believe substantially advance Apache Hadoop for both the open source community and our enterprise customers.
First, we’re announcing a new release of Cloudera’s Distribution for Hadoop – CDH3 Beta 2. This release, built on more than a year and a half of extensive engagement with real customers in the market, is the most comprehensive, capable and usable package available. Of course the Apache Hadoop project is the heart of our distribution, but we’ve added eight other open source packages that provide critical infrastructure and tools that are required to use Hadoop effectively in production.
The additional packages include HBase, the popular distributed columnar storage system with fast read-write access to data managed by HDFS, Hive and Pig for query access to data stored in a Hadoop cluster, Apache Zookeeper for distributed process coordination and Sqoop for moving data between Hadoop and relational database systems. We’ve adopted the outstanding workflow engine out of Yahoo!, Oozie, and have made contributions of our own to adapt it for widespread use by general enterprise customers. We’ve also released – this is a big deal, and I’m really pleased to announce it – our continuous data loading system, Flume, and our Hadoop User Environment software (formerly Cloudera Desktop, and henceforth “Hue”) under the Apache Software License, version 2.
Hadoop was created by