Hadoop and the Cloudera Data Platform.
When we announced Cloudera’s Distribution for Hadoop last month, we asked the community to give us feedback on what features they liked best and what new development was most important to them. Almost immediately, Debian and Ubuntu packages for Hadoop emerged as the most popular request. A lot of customers prefer Debian derivatives over Red Hat, and installing RPMs on top of Debian, while possible with tools like alien, is a pain to say the least.
After some weeks of development and testing, we are happy to announce the Cloudera APT Repository. APT is the standard package distribution mechanism for Ubuntu and Debian, and by simply pointing your machines at our repository, you can have Hadoop installed within minutes.
Our Debian packages are comprised of the same components as our RPM based distribution, including:
Today I did a web search for “pig training” using my favorite search engine. I was wildly entertained by the results, and have embedded my favorite for your viewing pleasure.
However, when I stopped laughing, I realized that this probably isn’t what most people reading this blog would have hoped to find. To that end, I am happy to announce that Cloudera’s Online Hadoop Training now includes two sessions on Apache Pig.
Welcome to the first guest post on the Cloudera blog. The other day, we saw Toby from Swingly tweeting about using Hadoop to process millions of other tweeters’ tweets. We were curious, and Toby put together a great writeup about how they use Hadoop to crunch data. We have a few other guest posts in the pipeline, but if you are doing something really fun with Hadoop and want to share, we’d love to hear from you. Get in touch! -Christophe
How can you run hundreds of memory intensive annotation tasks across billions of web documents to build a sweet semantic search engine before the sun goes nova? Use Hadoop.
Now the slightly longer version: I’m a software engineer working for a brand new start-up called Swingly. We are based in Northern Dallas and are working on creating a new semantic search engine that given a question can find answers on the open web (that includes both tons of standard web documents as well as large structured data sources!)
Last Tuesday – on my second day of work at Cloudera – I went to London to check out the second UK Hadoop User Group meetup, kindly hosted by Sun in a nice meeting room not far from the river Thames. We saw a day of talks from people heavily involved with Hadoop, both on the development and usage side and more often than not a bit of both. It was a great opportunity to put a selection of people all interested in Hadoop technology in the same room and find out what the current status and future directions of the project are.
There were around 55 attendees from a variety of organisations, both academic and professional. Tom White and I were there representing Cloudera, and there were attendees from Microsoft, HP, the Apache Software Foundation and the incredibly fashionable guys from Last.fm.
The slides and talks have been made available by the organisers here – they’re well worth checking out if you want to get a cross-section of some current activity around Hadoop. I’ve written up some notes on the talks which you can find below.
Practical MapReduce
One of the perks of using Java is the availability of functional, cross-platform IDEs. I use vim for my daily editing needs, but when it comes to navigating, debugging, and coding large Java projects, I fire up Eclipse.
Typically, when you’re developing Map-Reduce applications, you simply point Eclipse at the Hadoop jar file, and you’re good to go. (Cloudera’s Hadoop training VM has a fully-configured example.) However, when you want to dig deeper to explore—and modify—Hadoop’s internals themselves, you’ll want to configure Eclipse to build Hadoop. Because there’s generated code and a complicated ant build.xml file, this takes some tinkering. Now that I have the full Hadoop Eclipse experience going (it took me a few tries), I’ve prepared a screencast that will help guide you through it, from downloading Eclipse to debugging one of its unit tests. You’ll also want to reference the EclipseEnvironment Hadoop wiki page, which has more details.
In the process of working on a few things here I wanted to add some links to launch Hive and the Hadoop Jobtracker. At first I considered just adding the links but I found myself wanting a button of some sort; an icon for them. I didn’t want to just use the (awesomely cute) Hadoop logo elephant because these things are related to and part of Hadoop, but they aren’t Hadoop itself… What to do?
Well, I grabbed Illustrator and spent a bit of time putting together these icons. What do you think? We’ve opened up a ticket with the Hadoop project to contribute these to the project.
A few weeks ago we announced Cloudera’s Distribution for Hadoop, and I want to spend some time showing how our distribution makes a sysadmin’s job a little easier.
Perhaps the most useful features in our distribution, at least for sysadmins, are RPM packages and init scripts. RPMs are the standard way of installing software on a Red Hat Linux distribution (RHEL, Fedora Core, CentOS). They give sysadmins a one-command install, and they install libraries, binaries, init scripts, log files, man pages, and configuration files in places where Linux users expect them, typically /usr/lib, /usr/bin, /etc/init.d, /var/log, /usr/share/man, and /etc, respectively. RPMs are also very easy to uninstall and upgrade.
Init scripts are the standard way to start, stop, and restart daemon processes on a Linux system. They allow sysadmins to start and stop daemons with the /sbin/service script, and they use a standard parameter interface, namely start, stop, or restart (e.g., sudo /sbin/service hadoop-datanode start). Init scripts also make sure that the daemon runs as the correct user, which in Hadoop’s case is the hadoop user. Lastly, init scripts are used to start daemons at boot time, allowing daemons to survive reboots.
As Hadoop clusters grow in size and data volume, it becomes more and more useful to share them between multiple users and to isolate these users. If User 1 is running a ten-hour machine learning job for example, this should not impair a User 2 from running a 2-minute Hive query. In November, I blogged about how Hadoop 0.19 supports pluggable job schedulers, and how we worked with Facebook to implement a Fair Scheduler for Hadoop using this new functionality. The Fair Scheduler gives each user a configurable share of the cluster when he/she has running jobs, but assigns these resources to other users when the user is inactive. Since last fall, the Fair Scheduler has been picked up by Hadoop users outside Facebook, including the Google/IBM academic Hadoop cluster. It’s also received extensive testing and patches from Yahoo!. Furthermore, we’ve included the Fair Scheduler in Cloudera’s Distribution for Hadoop, where it is integrated right into the JobTracker management UI. Through production experiences, testing, and feedback from users, we’ve made a lot of improvements to the Fair Scheduler, some of which are available now and others which will come out in the next major version, which I’m calling “Fair Scheduler 2.0″. Here is a summary of the upcoming functionality:
- Fair sharing has changed from giving equal shares to each job to giving equal shares to each user. This means that users that submitted many jobs don’t get an advantage over users running a few jobs. It’s also possible to give different weights to different users.
- The fair scheduler now supports killing tasks from other users’ jobs if they are not giving them up. For each pool (by default there is one pool per user, but one can also have specially named pools), there’s a configurable timeout after which it can kill other jobs’ tasks to start running. This means that it’s possible to provide “service guarantees” for production jobs that are sharing a cluster with experimental queries.
- The scheduler can now assign multiple tasks per heartbeat, which is important for maintaining high utilization in large clusters.
- A technique called delay scheduling increases data locality for small jobs, improving performance in a data warehouse workload with many small jobs such as Facebook’s.
- The internal logic has been simplified so that the scheduler can support different scheduling policies within each pool, and in particular we plan to support FIFO pools. Many users have requested FIFO pools because they want to be able to queue up batch workflows on the same cluster that’s running more interactive jobs.
- Many bug fixes and performance improvements were contributed or suggested by a team stress-testing the scheduler at Yahoo!.
- The same team has also contributed Forrest web-based documentation for the fair scheduler (to be available in Hadoop 0.20).
As a grad student and the original developer of the Fair Scheduler, I’ve had a great experience interacting with the Hadoop community to improve the scheduler. The fact that production experience at Facebook, large-scale testing at Yahoo!, and wishes from other users are being combined into this single piece of software is a testament to the strength of Hadoop’s open-source model. The next release of the Fair Scheduler (likely in Hadoop 0.21, although we will also release back-ports to older Hadoop versions) will make it easier to manage multi-user clusters, give FIFO scheduling to users who desire it, improve performance and reduce the need for manual intervention with misbehaving jobs. You can also be sure that we’ll continue supporting the scheduler in Cloudera’s Distribution for Hadoop.

Hadoop was created by