Developer Center
Cloudera Blog · Guest Posts

Improving Hotel Search: Hadoop @ Orbitz Worldwide

This post was contributed by Jonathan Seidman from Orbitz. Jonathan is a Lead Engineer on the Intelligent Marketplace/Machine Learning team at Orbitz Worldwide . You can hear more from Jonathan at Hadoop World October 12th in NYC.

Orbitz Worldwide (NYSE:OWW) is composed of a global portfolio of online consumer travel brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub, Additionally, the company operates business-to-business service: Orbitz Worldwide Distribution provides third parties such as Amtrak, Delta, LAN, KLM, Air France and a number of other leading airlines hotel booking capabilities, and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients. The Orbitz Worldwide sites process millions of searches and transactions every day, which not surprisingly results in hundreds of gigabytes of log data per day. Not all of that data necessarily has value, but much of it does. Unfortunately storing and processing all of that data in our existing data warehouse infrastructure is impractical because of expense and space considerations.

Hadoop was selected to provide a solution to the problem of long-term storage and processing of these large quantities of un-structured and semi-structured data. We deployed our first Hadoop clusters in late 2009 running Cloudera’s Distribution for Hadoop (CDH), and in early 2010 deployed Hive to provide structure and SQL-like access to Hadoop data. In the short period of time since our initial deployment we’ve seen Hadoop rapidly adopted as a component in a wide range of applications across the organization due to its power, ease of use, and suitability for solving big data problems.

Integrating Hive and HBase

This post was contributed by John Sichi, a committer on the Apache Hive project and a member of the Data Infrastructure team at Facebook.

As many readers may already know, Hive was initially developed at Facebook for dealing with explosive growth in our multi-petabyte data warehouse.  Since its release as an Apache project, it has been put into use at a number of other companies for solving big data problems.  Hive storage is based on Hadoop’s underlying append-only filesystem architecture, meaning that it is ideal for capturing and analyzing streams of events (e.g. web logs).  However, a data warehouse also has to relate these event streams to application objects; in Facebook’s case, these include familiar items such as fan pages, user profiles, photo albums, or status messages.

Hive can store this information easily, even for hundreds of millions of users, but keeping the warehouse up to date with the latest information published by users can be a challenge, as the append-only constraint makes it impossible to directly apply individual updates to warehouse tables.  Up until now, the only practical option has been to periodically pull snapshots of all of the information from live MySQL databases and dump them to new Hive partitions.  This is a costly operation, meaning it can be done at most daily (leading to stale data in the warehouse), and does not scale well as data volumes continue to shoot through the roof.

That’s where HBase comes in.  HBase is a scaleout table store which can support a very high rate of row-level updates over massive amounts of data.  It sidesteps Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging intelligently based on data distribution changes.  Since it is based on Hadoop, making HBase interoperate with Hive is straightforward, meaning HBase tables can be accessed as if they were native Hive tables.  As a result, a single Hive query can now perform complex operations such as join, union, and aggregation across combinations of HBase and native Hive tables.  Likewise, Hive’s INSERT statement can be used to move data between HBase and native Hive tables, or to reorganize data within HBase itself.

How Raytheon BBN Technologies Researchers are Using Hadoop to Build a Scalable, Distributed Triple Store

This post was contributed by Kurt Rohloff, a researcher in the Information and Knowledge Technologies group of Raytheon BBN Technologies, a wholly owned subsidiary of Raytheon Company.

Using Hadoop to Build a Scalable, Distributed Triple Store

The driving idea behind Semantic Web is to provide a web-scale information sharing model and platform.  One of the singular advancements over the past several years in the Semantic Web domain has been the explosion of data available in semantic formats.  Unfortunately, Semantic Web data processing technologies are deployed on a single (or a small number of) machine(s) at a time.  This is fine when data is small, but current methodologies create horrible data processing and analysis bottlenecks.  These scalability constraints are the biggest barriers in achieving the fundamentally web-scale Semantic Web vision of Tim Berners-Lee.  More importantly, these limitations have hindered the broader adoption of Semantic Web technologies.

Tracking Trends with Hadoop and Hive on EC2


At Cloudera, we frequently work with leading Hadoop developers to produce guest blog posts of general interest to the community. We started a project with Pete Skomoroch a while back, and we were so impressed with his work, we’ve decided to bring Pete on as a regular guest blogger. Pete can show you how to do some pretty amazing things with Hadoop, Pig and Hive and has a particular bias towards Amazon EC2. With that, I’m happy to welcome Pete to the blog, and hope you enjoy his first post as much as we did. -Christophe

Trendingtopics.org was built by Data Wrangling to demonstrate how Hadoop and Amazon EC2 can be used with Rails to power a data-driven website.  This post will give an overview of how trendingtopics.org was put together and show some basic approaches for finding trends in log data with Hive.  The source code for trendingtopics is available on Github and a tutorial is provided on the Cloudera site which describes many of the data processing steps in greater detail.

The trendingtopics Rails application identifies recent trends on the web by periodically launching an EC2 cluster running Cloudera’s Distribution for Hadoop to process Wikipedia log files.  The cluster runs a Hive batch job that analyzes hourly pageview statistics for millions of Wikipedia articles, and then loads the resulting trend parameters into the application’s MySQL database.

Running the Cloudera Training VM in VirtualBox

Cloudera’s Training VM is one of the most popular resources on our website. It was created with VMware Workstation, and plays nicely with the VMware Player for Windows, Linux, and Mac. But VMware isn’t for everyone. Thomas Lockney has managed to get our VM image running on Virtual Box, and has written a step-by-step guide for the community. Thanks Thomas! – Christophe

I was quite pleased when I discovered that Cloudera had created a virtual machine image that could be used while working through their training material. It would make the process simpler, and it looked like a potentially useful environment for general Hadoop experimentation. However, their VM is built for VMware, which I stopped using a while back. However, as a heavy VirtualBox user, I knew that it would not be hard to get it running in my preferred desktop virtualization environment.

Here’s a step-by-step guide for getting Cloudera’s virtual machine image up and running. I’ll include screenshots for most of the steps to make it as clear as possible. I’ll assume you already have at least some familiarity with running VirtualBox (if not, there are plenty of good tutorials and references available online) and some experience with Ubuntu or some other fairly modern Linux desktop system.

Hadoop HA Configuration

One of the things we get a lot of questions about is how to make Hadoop highly available. There is still a lot of work to be done on this front, but we wanted to take a moment and share the best practices from one of our customers. Check out what Paul George has to say about how they keep thier NameNode up at ContextWeb. – Christophe

Here at ContextWeb, our Hadoop infrastructure has become a critical part of our day-to-day business operations. As such, it was important for us to find a way to resolve the single-point-of-failure issue that surrounds the master node processes, namely the NameNode and JobTracker. While it was easy for us to follow the best practice of offloading the secondary NameNode data to an NFS mount to protect metadata, ensuring that the processes were constantly available for job execution and data retrieval were of greater importance. We’ve leveraged some existing, well tested components that are available and commonly used in Linux systems today. Our solution primarily makes use of DRBD from LINBIT and Heartbeat from the Linux-HA project. The natural combination of these two projects provides us with a reliable and highly available solution, which addresses limitations that currently exist.

While one could conceivably expand the use of these two projects to much deeper levels of protection, the goal of this post is to provide a basic working configuration as a starting point for further experimentation and tuning. There may be variations with regards to what works with your distribution or which requirements your organization has for SLAs and HA standards. These instructions are most relevant to CentOS 5.3 combined with Cloudera’s Distribution for Hadoop, since that’s what we run in our production environment.

Hadoop Environment

Hadoop Graphing with Cacti

An important part of making sure Hadoop works well for all users is developing and maintaining strong relationships with the folks who run Hadoop day in and day out. Edward Capriolo keeps About.com’s Hadoop cluster happy, and we frequently chew the fat with Ed on issues ranging from administrative best practices to monitoring. Ed’s been an invaluable resource as we beta test our distribution and chase down bugs before our official releases. Today’s article looks at some of Ed’s tricks for monitoring Hadoop with Cacti through JMX. -Christophe

You may have already read Philip’s Hadoop Metrics post, which provides a general overview of the Hadoop Metrics system. Here, we’ll examine Hadoop monitoring with Cacti through JMX.

What is Cacti?

Cacti is an RRD front end. You can learn more about it on the Cacti website.

Rackspace Upgrades to Cloudera’s Distribution for Hadoop
Parallel LZO: Splittable Compression for Hadoop


Yesterday, Chris Goffinet from Digg made a great blog post about LZO and Hadoop. Many users have been frustrated because LZO has been removed from Hadoop’s core, and Chris highlights a great way to mitigate this while the project identifies an alternative with a compatible license. We liked the post so much, we asked Chris to share it with our audience. Thanks Chris! -Christophe

So at Digg, we have been working our own Hadoop cluster using Cloudera’s distribution. One of the things we have been working through is how can we split our large compressed data and run them in parallel on Hadoop? One of the biggest drawbacks from compression algorithms like Gzip is that you can’t split them into multiple mappers. This is where LZO comes in.

Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm that is focused on decompression speed.

The Smart Grid: Hadoop at the Tennessee Valley Authority (TVA)