The past month has been exciting and productive for the community using and developing Cloudera’s Flume! This young system is a core part of Cloudera’s Distribution for Hadoop (CDH) that is responsible for streaming data ingest. There has been a great influx of interest and many contributions, and in this post we will provide a quick summary of this month’s new developments. First, we’re happy to announce the availability of Flume v0.9.1 and we will describe some of its updates. Second, we’ll talk about some of the exciting new integration features coming down the pipeline. Finally we will briefly mention some community growth statistics, as well as some recent and upcoming talks about Flume.
Flume v0.9.1
Flume v0.9.1 is now available both in tarball and packaged forms. This version resolves 63 issues and contains several key improvements and bugs fixes. Much of this release is focused on improving the stability of Flume’s internals to help users quickly get Flume up and running and to help developers build extensions to Flume.
In anticipation of Hadoop World 2010 in New York – October 12th, we continue our Q&A series with Hadoop World presenters to provide a taste of what attendees can expect. We’re excited about the 36 presentations that are planned (see agenda) including talks from eBay, Twitter, GE, Bank of America, Facebook, Digg, HP and more. Tim O’Reilly, founder of O’Reilly Media is keynoting, which should be inspiring as well as thought provoking. Everyone who registers for Hadoop World will receive a free copy of the second edition of Tom White’s Hadoop: The Definitive Guide.
Hadoop World 2010 presenter Saptarshi Guha works in the Department of Statistics at Purdue University. His presentation for Hadoop World is titled “Using R and Hadoop to Analyze VoIP Network Data for QoS.” Guha has been developing with Hadoop and R for over a year.
Q: What can attendees expect learn about Hadoop from your presentation at Hadoop World?
The quality of VoIP calls are suspect to the queuing effects introduced by the network gateways. The jitter between two consecutive packets is the deviation of the real inter-arrival time from theoretical.
Migrating to CDH – August 2
You will learn everything you need to know about migrating with CDH3b2 ranging from why migrate to testing.
Flume community update - August 3
In this blog we address Flume issues, talk about new features, and the improvement of the platform.
Hadoop World: early-bird rate ends on August 11 – August 9
The early-bird registration window may have passed, however, it is not too late to register for Hadoop World. Register Now!
Cloudera’s Henry Robinson to speak at Hadoop Day in Seattle - August 10
Excitement is building as Hadoop World nears and we are sitting down with some of our presenters to ask them a few questions regarding their presentations and how they are using Hadoop within their organization. Here we speak with Philip Kromer, President of Infochimps, who answers questions regarding his presentation, how Hadoop is used in his business, and what he aims to get out of Hadoop World. Philip’s presentation at Hadoop World is about the development of a data marketplace and commoditization, and their chimpanzee-style approach to data processing. Attend Hadoop World October 12th in New York to hear more from and to talk to Philip.
What can attendees expect learn about Hadoop from your presentation at Hadoop World?
We’re now able to quantify aspects of human behavior never before accessible. Twitter, the News stream, the Smart Grid, are exquisite lab instruments for measuring ‘Conversation’, ‘Interest’, ‘Activity’. What’s more, with enough data machine-learning algorithms and big data tools let us expose insight using only the *structure*, not the content of the data. The massive quantity and connectivity required demands industrial-strength tools such as Hadoop.
We do *all* our data processing in high level tools (chiefly Pig and Wukong) — “black boxes with flexible glue”. We use ‘programmer fun’ + ‘programmer time’ as our primary development metrics. Together, writing simple loosely coupled scripts lets us run the fast experiment-driven design cycles that a lean startup demands. It has also let us grow our own talent and recruit outside CS (physicists, in particular, dream in map reduce). I think this approach should have strong appeal to small- and medium-sized businesses, or anyone looking for low barrier-to-adoption of Hadoop.
Do you have Hadoop in production use today?
That’s right, sign up for any of the training courses surrounding Hadoop World 2010, and receive a complimentary pass to the conference! There are seven different courses on offer, so whether you are new to Hadoop or looking to deepen your skills, you’ll find something to fit your needs.
If you are a manager trying to decide whether Hadoop is an appropriate technology for your organization, Hadoop Essentials for Managers will answer your questions. We will show you when using Hadoop is appropriate, what Hadoop is being used for in a range of industries, how Hadoop fits into your existing environment and what you need to know in order to deploy it within your organization.
Why not turn your Hadoop World trip into a multiple day Hadoop learning extravaganza by attending one of our two-day sessions? Both the developer and administrator training courses culminate in an exam which, when passed, confers Cloudera Certified Hadoop Developer or Administrator status.
Hadoop is increasingly being adopted by many Fortune 500 enterprises. Some of the speakers featured at Hadoop World this year include leading companies who have been able to create new value for their business using Hadoop. The presentations at Hadoop World are focused on how Hadoop is solving business problems for these enterprises. Below are three examples of leading enterprises that will present how Hadoop has impacted their businesses.
GE, Product Manager, Linden Hillenbrand, will be talking about how Hadoop has improved GE’s Marketing & Communications functions. One capability GE has implemented is assessing the external perception of GE–positive, neutral, or negative–through various marketing campaigns.
Apache Hadoop 0.21.0 was released on August 23, 2010. The last major release was 0.20.0 in April last year, so it’s not surprising that there are so many changes in this release, given the amount of activity in the Hadoop development community. In fact, there were over 1300 issues fixed in JIRA (Common, HDFS, MapReduce), the issue tracker used for Apache Hadoop development. Bear in mind that the 0.21.0 release, like all dot zero releases, isn’t suitable for production use.
With such a large delta from the last release, it is difficult to grasp the important new features and changes. This post is intended to give a high-level view of some of the more significant features introduced in the 0.21.0 release. Of course, it can’t hope to cover everything, so please consult the release notes (Common, HDFS, MapReduce) and the change logs (Common, HDFS, MapReduce) for the full details. Also, please let us know in the comments of any features, improvements, or bug fixes that you are excited about.
You can download Hadoop 0.21.0 from an Apache Mirror. Thanks to everyone who contributed to this release!
Cloudera’s Hadoop Training and Certification for System Administrators has made it across the Atlantic to London for the first time! This two-day course covers planning, deploying, maintaining, monitoring, and troubleshooting your Hadoop cluster. We’ll talk about HDFS, MapReduce, Hive, Pig, HBase, Flume and more, from the System Administrator’s point of view. Take the certification exam at the end of your training and go home with a valuable validation of your Hadoop knowledge.
Enter the code “london_10pct” when registering and receive a 10% discount!
This post was contributed by Jonathan Seidman from Orbitz. Jonathan is a Lead Engineer on the Intelligent Marketplace/Machine Learning team at Orbitz Worldwide . You can hear more from Jonathan at Hadoop World October 12th in NYC.
Orbitz Worldwide (NYSE:OWW) is composed of a global portfolio of online consumer travel brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub, Additionally, the company operates business-to-business service: Orbitz Worldwide Distribution provides third parties such as Amtrak, Delta, LAN, KLM, Air France and a number of other leading airlines hotel booking capabilities, and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients. The Orbitz Worldwide sites process millions of searches and transactions every day, which not surprisingly results in hundreds of gigabytes of log data per day. Not all of that data necessarily has value, but much of it does. Unfortunately storing and processing all of that data in our existing data warehouse infrastructure is impractical because of expense and space considerations.
Hadoop was selected to provide a solution to the problem of long-term storage and processing of these large quantities of un-structured and semi-structured data. We deployed our first Hadoop clusters in late 2009 running Cloudera’s Distribution for Hadoop (CDH), and in early 2010 deployed Hive to provide structure and SQL-like access to Hadoop data. In the short period of time since our initial deployment we’ve seen Hadoop rapidly adopted as a component in a wide range of applications across the organization due to its power, ease of use, and suitability for solving big data problems.
It’s easy to get started with Hadoop administration because Linux system administration is a pretty well-known beast, and because systems administrators are used to administering all kinds of existing complex applications. However, there are many common missteps we’re seeing that make us believe there’s a need for some guidance in Hadoop administration. Most of these mistakes come from a lack of understanding about how Hadoop works. Here are just a few of the common issues we find:
Lack of configuration management
It makes sense to start with a small cluster and then to scale out over time as you find initial success and your needs grow. Without a centralized configuration management framework, you end up with a number of issues that can cascade just as your usage picks up. For example, manually ssh-ing and scp-ing files around by hand is a great way to effectively manage a small handful of machines, but as soon as your cluster gets to 5 or more nodes (let alone tens or hundreds, where Hadoop really shines), it becomes very cumbersome to manage and confusing to keep track of which files go (and have gone) where. Also, as the cluster evolves and becomes more heterogeneous, you have different versions of config files to manage, with each version changing over time. This adds a version control requirement to your configuration management. Such a solution might include a parallel shell and other established routines for starting and stopping cluster processes, copying files around the cluster, and making sure cluster configurations are kept in sync.

Hadoop was created by