Hadoop and the Cloudera Data Platform.
When Hadoop started out, it was designed mainly for running large batch jobs such as web indexing and log mining. Users submitted jobs to a queue, and the cluster ran them in order. However, as organizations placed more data in their Hadoop clusters and developed more computations they wanted to run, another use case became attractive: sharing a MapReduce cluster between multiple users. The benefits of sharing are tremendous: with all the data in one place, users can run queries that they may never have been able to execute otherwise, and costs go down because system utilization is higher than building a separate Hadoop cluster for each group. However, sharing requires support from the Hadoop job scheduler to provide guaranteed capacity to production jobs and good response time to interactive jobs while allocating resources fairly between users.
This July, the scheduler in Hadoop became a pluggable component and opened the door for innovation in this space. The result was two schedulers for multi-user workloads: the Fair Scheduler, developed at Facebook, and the Capacity Scheduler, developed at Yahoo.
The Fair Scheduler arose out of Facebook’s need to share its data warehouse between multiple users. Facebook started using Hadoop to manage the large amounts of content and log data it accumulated every day. Initially, there were only a few jobs that needed to run on the data each day to build reports. However, as other groups within Facebook started to use Hadoop, the number of production jobs increased. In addition, analysts started using the data warehouse for ad-hoc queries through Hive (Facebook’s SQL-like query language for Hadoop), and more large batch jobs were submitted as developers experimented with the data set. Facebook’s data team considered building a separate cluster for the production jobs, but saw that this would be extremely expensive, as data would have to be replicated and the utilization on both clusters would be low. Instead, Facebook built the Fair Scheduler, which allocates resources evenly between multiple jobs and also supports capacity guarantees for production jobs. The Fair Scheduler is based on three concepts:
We’re happy to announce a new tool we have been developing here at Cloudera: Hadoop Development Status. Hadoop Development Status aims to help the Hadoop community understand its direction, health, and participants. The project currently monitors the most active contributors according to mailing list traffic, the most watched JIRA tickets, and aggregate traffic volumes on the Hadoop mailing lists.
The graph of messages per month on the Hadoop Core lists shows a sustained growth in traffic. During this time, new sub-projects have been added to the Hadoop Top Level Project (HBase, ZooKeeper, Pig, Hive), but we haven’t created graphs for them yet. In fact, HBase was in core as a contrib module until February this year, when it became a sub-project. The growth in traffic on the core lists makes them difficult to follow, and this is one of the reasons for the planned partitioning of Core into Core, HDFS and MapReduce sub-projects, and the promotion of Hive into a sub-project.
Contributions to Hadoop take many forms, including writing code, answering the questions of other users, and creating documentation. The number of messages sent to the mailing list is just one measure of how active a contributor is: you can see such a graph here.
It is common for a MapReduce program to require one or more files to be read by each map or reduce task before execution. For example, you may have a lookup table that needs to be parsed before processing a set of records. To address this scenario, Hadoop’s MapReduce implementation includes a distributed file cache that will manage copying your file(s) out to the task execution nodes.
The DistributedCache was introduced in Hadoop 0.7.0; see HADOOP-288 for more detail on its origins. There is a great deal of existing documentation for the DistributedCache: see the Hadoop FAQ, the MapReduce Tutorial, the Hadoop Javadoc, and the Hadoop Streaming Tutorial. Once you’ve read the existing documentation and understand how to use the DistributedCache, come on back.
How is the DistributedCache accessed by a MapReduce job?
As promised in my post about installing Scribe for log collection, I’m going to cover how to configure and use Scribe for the purpose of collecting Hadoop logs. In this post I’ll describe how to create the Scribe Thrift client for use in Java, add a new log4j Appender to Hadoop, configure Scribe, and collect logs from each node in a Hadoop cluster. At the end of the post, I will link to all source and configuration files mentioned in this guide.
What’s the advantage of collecting Hadoop’s logs?
Collecting Hadoop’s logs in to one single log file is very convenient for debugging and monitoring purposes. Essentially, what Scribe lets you do is tail -f your entire cluster, which is really cool to see
. Having one single log file on one, or even just a few, machines makes log analysis insanely simpler. Making log analysis simpler means the creation of monitoring and reporting tools just got a lot easier as well. Let’s get started …
Create the Scribe Thrift Client for Java
In order to use Scribe with Java, you need to use Thrift to create a Scribe client stub, which is a collection of generated Java files to be used by our custom log4j Appender, which will be covered later. For now, cd in to your Scribe distribution directory (I used trunk), and then cd in to the ‘if’ directory. Now you’re in $SCRIBE_DIST/if. Edit scribe.thrift by changing the ‘include’ line that includes fb303.thrift to the following:
Hadoop was created by