Configuring a Hadoop cluster is something akin to voodoo. There are a large number of variables in hadoop-default.xml that you can override in hadoop-site.xml. Some specify file paths on your system, but others adjust levers and knobs deep inside Hadoop’s guts. Unfortuately, there’s little or no documentation on how to set them well. Is there a single optimal configuration? Are there some settings that can just be “set to 11?”
At Cloudera, we’re working hard to make Hadoop easier to use and to make configuration less painful. Our Hadoop Configuration Tool gives you a web-based guide to help set up your cluster. Once it’s running, though, you might want to look under the hood and tune things a bit.
The rest of this post discusses why it’s a bad idea to just set all the limits as high as they’ll go, and gives you some pointers to get started on finding a happy medium.
Why can’t you just set all the limits to 1,000,000?
One of the repeating themes we have heard while working with our customers and the community is that Hadoop configuration and deployment is a pain. Often times, Hadoop is the first truly distributed system that administrators encounter, and the problem is made worse by the lack of standardized packages and deployment tools. And some releases are buggy. And upgrades are hard. And the list goes on.
In order for Hadoop to truly disrupt the enterprise, it needs to be just as easy to configure, deploy and manage as any other piece of software.
We’d like to take a step in that direction and share our distribution with the community. We developed our distribution to improve reliability and operations for our support customers, and while they will always be the first to receive updates and hot fixes, the community will never be far behind.
Exciting news: We’re providing our basic hadoop training for free online. We’ll still host basic courses live, but for folks who can’t make it to the Bay Area, or want to attend a more advanced training course first, we hope this proves useful.
There are 6 lectures, 2 hands-on activities, and 1 tutorial. We provide a virtual machine for the activities and tutorials so new users can get up and running right away. Topics include:
Hadoop’s NameNode, SecondaryNameNode, DataNode, JobTracker, and TaskTracker daemons all expose runtime metrics. These are handy for monitoring and ad-hoc exploration of the system and provide a goldmine of historical data when debugging.
In this post, I’ll first talk about saving metrics to a file. Then we’ll walk through some of the metrics data. Finally, I’ll show you how to configure sending metrics to other systems and explore them with jconsole.
Dumping metrics to a file
The simplest way to configure Hadoop metrics is to funnel them into a user-configurable file on the machine running the daemon. Metrics are organized into “contexts” (Hadoop currently uses “jvm”, “dfs”, “mapred”, and “rpc”), and each context is independently configured. Setup your conf/hadoop-metrics.properties to use FileContext like so:
Hadoop’s strength is that it enables ad-hoc analysis of unstructured or semi-structured data. Relational databases, by contrast, allow for fast queries of very structured data sources. A point of frustration has been the inability to easily query both of these sources at the same time. The DBInputFormat component provided in Hadoop 0.19 finally allows easy import and export of data between Hadoop and many relational databases, allowing relational data to be more easily incorporated into your data processing pipeline.
This blog post explains how the DBInputFormat works and provides an example of using DBInputFormat to import data into HDFS.
DBInputFormat and JDBC
First we’ll cover how DBInputFormat interacts with databases. DBInputFormat uses JDBC to connect to data sources. Because JDBC is widely implemented, DBInputFormat can work with MySQL, PostgreSQL, and several other database systems. Individual database vendors provide JDBC drivers to allow third-party applications (like Hadoop) to connect to their databases. Links to popular drivers are listed in the resources section at the end of this post.
Hadoop was created by