Hadoop and the Cloudera Data Platform.
An important part of making sure Hadoop works well for all users is developing and maintaining strong relationships with the folks who run Hadoop day in and day out. Edward Capriolo keeps About.com’s Hadoop cluster happy, and we frequently chew the fat with Ed on issues ranging from administrative best practices to monitoring. Ed’s been an invaluable resource as we beta test our distribution and chase down bugs before our official releases. Today’s article looks at some of Ed’s tricks for monitoring Hadoop with Cacti through JMX. -Christophe
You may have already read Philip’s Hadoop Metrics post, which provides a general overview of the Hadoop Metrics system. Here, we’ll examine Hadoop monitoring with Cacti through JMX.
What is Cacti?
Cacti is an RRD front end. You can learn more about it on the Cacti website.
A number of organizations donate server space and bandwidth to the Apache Foundation; when you download Hadoop, Tomcat, Maven, CouchDB, or any of the other great Apache projects, the bits are sent to you from a large list of mirrors. One of the ways in which Cloudera supports the open source community is to host such a mirror.
In this blog post, we will use Pig to examine the download logs recorded on our server, demonstrating several features that are often glossed over in introductory Pig tutorials—parameter substitution in PigLatin scripts, Pig Streaming, and the use of custom loaders and user-defined functions (UDFs). It’s worth mentioning here that, as of last week, the Cloudera Distribution for Hadoop includes a package for Pig version 0.2 for both Red Hat and Ubuntu, as promised in an earlier post. It’s as simple as apt-get install pig or yum install hadoop-pig.
There are many software packages that can do this kind of analysis automatically for you on average-sized log files, of course. However, many organizations log so much data and require such custom analytics that these ordinary approaches cease to work. Hadoop provides a reliable method for scaling storage and computation; PigLatin provides an expressive and flexible language for data analysis.
In addition to providing you with a dependable release of Hadoop that is easy to configure, at Cloudera we also focus on developing tools to extend Hadoop’s usability, and make Hadoop a more central component of your data infrastructure. In this vein, we’re proud to announce the availability of Sqoop, a tool designed to easily import information from SQL databases into your Hadoop cluster.
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:
One of the perks of using Java is the availability of functional, cross-platform IDEs. I use vim for my daily editing needs, but when it comes to navigating, debugging, and coding large Java projects, I fire up Eclipse.
Typically, when you’re developing Map-Reduce applications, you simply point Eclipse at the Hadoop jar file, and you’re good to go. (Cloudera’s Hadoop training VM has a fully-configured example.) However, when you want to dig deeper to explore—and modify—Hadoop’s internals themselves, you’ll want to configure Eclipse to build Hadoop. Because there’s generated code and a complicated ant build.xml file, this takes some tinkering. Now that I have the full Hadoop Eclipse experience going (it took me a few tries), I’ve prepared a screencast that will help guide you through it, from downloading Eclipse to debugging one of its unit tests. You’ll also want to reference the EclipseEnvironment Hadoop wiki page, which has more details.
We’ve been talking to enterprise users of Hadoop about existing and new projects, and lots of them are asking questions about reliability and data integrity. So we wrote up a short paper entitled HDFS Reliability to summarize the state of the art and provide advice. We’d like to get your feedback, too, so please leave a comment.
As promised in my post about installing Scribe for log collection, I’m going to cover how to configure and use Scribe for the purpose of collecting Hadoop logs. In this post I’ll describe how to create the Scribe Thrift client for use in Java, add a new log4j Appender to Hadoop, configure Scribe, and collect logs from each node in a Hadoop cluster. At the end of the post, I will link to all source and configuration files mentioned in this guide.
What’s the advantage of collecting Hadoop’s logs?
Collecting Hadoop’s logs in to one single log file is very convenient for debugging and monitoring purposes. Essentially, what Scribe lets you do is tail -f your entire cluster, which is really cool to see
. Having one single log file on one, or even just a few, machines makes log analysis insanely simpler. Making log analysis simpler means the creation of monitoring and reporting tools just got a lot easier as well. Let’s get started …
Create the Scribe Thrift Client for Java
In order to use Scribe with Java, you need to use Thrift to create a Scribe client stub, which is a collection of generated Java files to be used by our custom log4j Appender, which will be covered later. For now, cd in to your Scribe distribution directory (I used trunk), and then cd in to the ‘if’ directory. Now you’re in $SCRIBE_DIST/if. Edit scribe.thrift by changing the ‘include’ line that includes fb303.thrift to the following:
Scribe is a newly released log collection tool that dumps log files from various nodes in a cluster to Scribe servers, where the logs are stored for further use. Facebook describes their usage of Scribe by saying, “[Scribe] runs on thousands of machines and reliably delivers tens of billions of messages a day.” It turns out that Scribe is rather difficult to install, so the hope of this post is to help those of you attempting to install Scribe. The first step is to get dependencies installed.
Dependencies
Scribe has many dependencies that must be installed in order for Scribe to be built properly. They are listed here:
Hadoop was created by