Cloudera Hadoop & Big Data Blog http://www.cloudera.com/blog Hadoop & Big Data Blog Fri, 03 Jul 2009 17:26:31 +0000 http://wordpress.org/?v=2.7 en hourly 1 Debugging MapReduce Programs With MRUnit http://www.cloudera.com/blog/2009/07/03/debugging-mapreduce-programs-with-mrunit/ http://www.cloudera.com/blog/2009/07/03/debugging-mapreduce-programs-with-mrunit/#comments Fri, 03 Jul 2009 17:26:31 +0000 aaron http://www.cloudera.com/blog/?p=933 The distributed nature of MapReduce programs makes debugging a challenge. Attaching a debugger to a remote process is cumbersome, and the lack of a single console makes it difficult to inspect what is occurring when several distributed copies of a mapper or reducer are running concurrently. Furthermore, operations that work on small amounts of input (e.g., saving the inputs to a reducer in an array) fail when running at scale, causing out-of-memory exceptions or other unintended effects.

A full discussion of how to debug MapReduce programs is beyond the scope of a single blog post, but I’d like to introduce you to a tool we designed at Cloudera to assist you with MapReduce debugging: MRUnit.

MRUnit helps bridge the gap between MapReduce programs and JUnit by providing a set of interfaces and test harnesses, which allow MapReduce programs to be more easily tested using standard tools and practices.

While this doesn’t solve the problem of distributed debugging, many common bugs in MapReduce programs can be caught and debugged locally. For this purpose, developers often try to use JUnit to test their MapReduce programs. The current state of the art often involves writing a set of tests that each create a JobConf object, which is configured to use a mapper and reducer, and then set to use the LocalJobRunner (via JobConf.set(”mapred.job.tracker”, “local”)). A MapReduce job will then run in a single thread, reading its input from test files stored on the local filesystem and writing its output to another local directory.

This process provides a solid mechanism for end-to-end testing, but has several drawbacks. Developing new tests requires adding test inputs to files that are stored alongside one’s program. Validating correct output also requires filesystem access and parsing of the emitted data files. This involves writing a great deal of test harness code, which itself may contain subtle bugs. Finally, this process is slow. Each test requires several seconds to run. Users often find themselves aggregating several unrelated inputs into a single test (violating a unit testing principle of isolating unrelated tests) or performing less exhaustive testing due to the high barriers to test authorship.

The easiest way to test MapReduce programs is to include as little Hadoop-specific code as possible in one’s application. Parsers can operate on instances of String instead of Text, and mappers should instantiate instances of MySpecificParser to tokenize input data rather than embed parsing code in the body of MyMapper.map(). Your MySpecificParser implementation can then be tested with ordinary JUnit tests. Another class or method could then be used to perform processing on parsed lines.

But even with those components separately tested, your map() and reduce() calls should still be tested individually, as the composition of separate classes may cause unintended bugs to surface. MRUnit provides test drivers that accept programmatically specified inputs and outputs, which validate the correct behavior of mappers and reducers in isolation, as well as when composed in a MapReduce job. For instance, the following code checks whether the IdentityMapper emits the same (key, value) pair as output that it receives as input:

import junit.framework.TestCase;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.lib.IdentityMapper;
import org.junit.Before;
import org.junit.Test;

public class TestExample extends TestCase {

  private Mapper mapper;
  private MapDriver driver;

  @Before
  public void setUp() {
    mapper = new IdentityMapper();
    driver = new MapDriver(mapper);
  }

  @Test
  public void testIdentityMapper() {
    driver.withInput(new Text("foo"), new Text("bar"))
            .withOutput(new Text("foo"), new Text("bar"))
            .runTest();
  }
}

The MapDriver orchestrates the test process, feeding the input (“foo” and “bar”) record to the IdentityMapper when its runTest() method is called. It also passes a mock OutputCollector implementation to the mapper. The driver then validates the output received by the OutputCollector against the expected output (”foo” and “bar”) record. If the actual and expected outputs mismatch, a JUnit assertion failure is raised, informing the developer of the error. More test drivers exist for testing individual reducers, as well as mapper/reducer compositions.

End-to-end tests involving JobConf configuration code, InputFormat and OutputFormat implementations, filesystem access, and larger scale testing are still necessary. But many errors can be quickly identified with small tests involving a single, well-chosen input record, and a suite of regression tests allows correct behavior to be assured in the face of ongoing changes to your data processing pipeline. We hope MRUnit helps your organization test code, find bugs, and improve its use of Hadoop and big data by facilitating faster and more thorough test cycles.

MRUnit is open source and is included in Cloudera’s Distribution for Hadoop. For more information about MRUnit, including where to get it and how to use its API, see the MRUnit documentation page.

]]>
http://www.cloudera.com/blog/2009/07/03/debugging-mapreduce-programs-with-mrunit/feed/
Rackspace Upgrades to Cloudera’s Distribution for Hadoop http://www.cloudera.com/blog/2009/06/30/rackspace-upgrades-to-clouderas-distribution-for-hadoop/ http://www.cloudera.com/blog/2009/06/30/rackspace-upgrades-to-clouderas-distribution-for-hadoop/#comments Tue, 30 Jun 2009 16:17:55 +0000 Christophe Bisciglia http://www.cloudera.com/blog/?p=884
Hadoop moves fast. Users often find that they need to upgrade after just a few months. Upgrading can be a daunting task, especially if you are several versions behind. We’ve been working with Rackspace for a while now, and they recently embarked on an upgrade from Hadoop 0.15.3 to Cloudera’s Distribution for Hadoop based on 0.18.3. Stu Hood, Search Team Technical Lead at Rackspace, was kind enough to document their experience, and we’re happy to share it with you here. -Christophe

Upgrading to the Cloudera Distribution

Hadoop plays an integral part in the email analytics performed at Rackspace Email and Apps, and our installation of Apache Hadoop 0.15.3 ran smoothly for 18 months after we deployed it in January 2008. By the time we decided to upgrade to Cloudera’s Distribution for Hadoop in June 2009, our production cluster had performed almost 600,000 MapReduce jobs.

In the past, we have deployed Hadoop along with our primary MapReduce application by checking the entire Hadoop distribution and our configuration into version control. Deploying a new slave for the cluster involved running custom scripts to create users, directories, and install dependencies.

There were a few important reasons to upgrade a cluster as trusty as ours to Cloudera’s Distribution for Hadoop (version 0.18.3):

  • Hadoop improves rapidly (since version 0.15.3 was released, over 1500 JIRA issues were resolved).
  • The Cloudera Distribution contains backported patches that are considered stable, but have not been applied to previous versions by the Apache project, such as the FairScheduler. Some of these patches fix critical bugs, add new features, or improve performance.
  • Cloudera’s configuration RPMs maintain the optimal settings for the installed version of Hadoop. Tweaking these settings manually would involve far more research than we can afford.
  • Standardizing on a Red Hat deployment infrastructure like RPM and YUM makes it much easier to track the latest stable version of Hadoop.

Steps

Configure Hadoop

In order to take advantage of Cloudera’s recommended configuration values, we decided to use Cloudera’s Configurator for Hadoop to generate the configuration that we would be using on the upgraded cluster.

We started by following the steps at https://my.cloudera.com/, using parameters that matched our current configuration. Since we were upgrading an existing cluster, it was important that the data directories matched up in our new configuration. The following table describes mapping between entries made in the GUI as well as those in the generated configuration files:

Step 2: NameNode Metadata Path(s) dfs.name.dir
Step 3: Secondary NameNode Metadata Path(s) fs.checkpoint.dir
Step 5: TaskTracker Intermediate Data Path(s) mapred.local.dir
Step 5: HDFS Data Path(s) dfs.data.dir

Note that the configurator does not support the type of variable expansion that Hadoop’s configuration files sometimes do. One such example is ${hadoop.tmp.dir} expanding to the Hadoop temporary directory.

If one of your previous configuration values used variable expansion for ${username}, you would need to replace ${username} with the name of the user that you had previously used to run the Hadoop daemons. In our case, we needed to replace instances of ${username} in the dfs.data.dir and dfs.name.dir values with user “hadoopuser.”

When we reached the end of the configurator, we downloaded the generated hadoop-site.xml* files and the Cloudera Repository RPM, and then recorded our repository ID. To double check that our data directories were configured properly, we compared the values (from the table above) in the new hadoop-site.xml* files against our previous configuration. If you see any mismatches at this step, you will probably want to restart the configurator until the resulting files are consistent.

Upgrade

At this point, it was time to jump into the upgrade. We installed the Configurator RPM, which we had downloaded earlier on all machines in our cluster by walking through the steps in the config guide. After listing out the available configuration packages with yum search hadoop-conf, we installed the matching packages for each class of machine in the cluster using yum install $packagename. At this point, the new version of Hadoop was installed, but not running.

In order to swap out the running version of Hadoop, and create a backup of the current filesystem, we needed to follow the steps leading up to the “Install New Version” step from the Hadoop Wiki upgrade page. After walking through those preparation instructions and successfully shutting down the cluster, it was time to make the switch.

Cloudera’s Distribution for Hadoop creates user ”hadoop” and this user runs all of the necessary services/daemons for the cluster. If your cluster had previously been running with a different username (ours was running as ”hadoopuser”) you will need to give the new user ownership of various different directories. We ran…

# chown -R hadoop $directory

…for each of the following configured directories:

* dfs.data.dir,
* dfs.name.dir,
* fs.checkpoint.dir,
* mapred.local.dir,
* hadoop.tmp.dir,
* /var/log/hadoop (FIXME: the hardcoded(?) log directory)

Once the ”hadoop” user had access to the necessary directories, we were ready to upgrade the Namenode. We ran the following command from our Namenode machine, so that the process would start in the background and begin upgrading its checkpoint:

$ sudo -u hadoop /usr/lib/hadoop/bin/hadoop-daemon.sh –config “/etc/hadoop/conf” start namenode -upgrade

We watched the “Upgrades” section of the DFS status page at http://$namenode:50070/ while waiting for the Namenode upgrade to complete, and then we started up the remaining Hadoop services on their respective machines using the instructions from the “Managing Hadoop Services” section of the config guide.

Finalize

Code Changes

Once our cluster was upgraded, we needed to port our Hadoop jobs to the Hadoop 0.18.3 API. There were actually only minor changes in the MapReduce and FileSystem APIs between 0.15.3 and 0.18.3:

  • Our OutputFormats needed to extend FileOuputFormat, rather than OutputFormatBase.
  • FileSystem.listPaths() was removed, in favor of .globPaths().

Finalizing the Upgrade

After verifying that our newly updated jobs were running correctly against the cluster, we were ready to make the changes permanent. The dfsadmin -finalizeUpgrade command runs in the background and cleans up the outdated copies of blocks left behind by the upgrade, freeing disk space.

$ sudo -u hadoop hadoop dfsadmin -finalizeUpgrade

Future

Now that we’ve upgraded to the Cloudera distribution using the configurator, it will be much easier to stay at the bleeding edge of Hadoop development (or the cutting edge, if we choose stability over features). We can also add the Cloudera repository RPM to our base server image and add a single command to pull down the entire distribution from Yum. Finally, we can conveniently install the packages for Pig and Hive to give our developers more options for their processing jobs.

]]>
http://www.cloudera.com/blog/2009/06/30/rackspace-upgrades-to-clouderas-distribution-for-hadoop/feed/
Parallel LZO: Splittable Compression for Hadoop http://www.cloudera.com/blog/2009/06/24/parallel-lzo-splittable-compression-for-hadoop/ http://www.cloudera.com/blog/2009/06/24/parallel-lzo-splittable-compression-for-hadoop/#comments Wed, 24 Jun 2009 16:52:37 +0000 Christophe Bisciglia http://www.cloudera.com/blog/?p=902


Yesterday, Chris Goffinet from Digg made a great blog post about LZO and Hadoop. Many users have been frustrated because LZO has been removed from Hadoop’s core, and Chris highlights a great way to mitigate this while the project identifies an alternative with a compatible license. We liked the post so much, we asked Chris to share it with our audience. Thanks Chris! -Christophe

So at Digg, we have been working our own Hadoop cluster using Cloudera’s distribution. One of the things we have been working through is how can we split our large compressed data and run them in parallel on Hadoop? One of the biggest drawbacks from compression algorithms like Gzip is that you can’t split them into multiple mappers. This is where LZO comes in.

Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm that is focused on decompression speed.

The LZO library implements a number of algorithms with the following features:

  • Compression is comparable in speed to deflate compression.
  • On modern architectures, decompression is very fast; in non-trivial cases able to exceed the speed of a straight memory-to-memory copy due to the reduced memory-reads.
  • Requires an additional buffer during compression (of size 8 kB or 64 kB, depending on compression level).
  • Requires no additional memory for decompression other than the source and destination buffers.
  • Allows the user to adjust the balance between compression quality and compression speed, without affecting the speed of decompression.

This is great until you start trying to actually get LZO working on Hadoop. First off, it gets really confusing when its now removed from Hadoop 0.20+ because of GPL restrictions.

I first came across a blog post by Johan Oskarsson that discussed this. Unfortunately when you dive into HADOOP-4640 you find out it’s against 0.20. Cloudera’s distribution uses a modified version of 0.18.3. The patch from HADOOP-4640 applies pretty cleanly besides a few things. On top of this, you need HADOOP-2664 which enables LZOP codec. You actually need this because the compressor on most Linux systems is `lzop` and that differs from the traditional LzoCodec bundled in 0.18.

So how do we get all of this working? First off grab both modified patches from my Github account.

Once you have those, apply the patches to your Cloudera distribution. Then be sure to rebuild. After that’s done and you have redeployed to your clients and production cluster you need to modify your hadoop-site.xml on the client side.

<property>
<name>io.compression.codecs</name>

<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.LzopCodec</value>
<description>A list of the compression codec classes that can be used for compression/decompression.</description>
</property>

Once that is completed, go ahead and upload your large LZO file to your Hadoop cluster.

So lets say you uploaded the file:

$ hadoop fs -put large_file.lzo /tmp/large_file.lzo

The next step is you need to index your LZO file, so that hadoop knows how to split the file into multiple mappers.

The Indexer.jar in the my Github account will be used for this process. Now you need to run the Indexer.jar and tell it what file to generate an index file for.

$ hadoop jar Indexer.jar /tmp/large_file.lzo

After that’s completed, you’re almost there! The index file will be created in /tmp. Now all you need to do is run a map/reduce job and your set! Don’t forget to set the -inputFormat parameter. Here is a code snippet using wordcount example:

#!/bin/sh
HADOOP_HOME=/usr/lib/hadoop
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-0.18.3-7-streaming.jar \
-input /tmp/large_file.lzo \
-output wc_test \
-inputformat org.apache.hadoop.mapred.LzoTextInputFormat \
-mapper 'cat' \
-reducer 'wc -l'
]]>
http://www.cloudera.com/blog/2009/06/24/parallel-lzo-splittable-compression-for-hadoop/feed/
A Great Week for Hadoop: Summit Roundup http://www.cloudera.com/blog/2009/06/22/a-great-week-for-hadoop-summit-west-roundup/ http://www.cloudera.com/blog/2009/06/22/a-great-week-for-hadoop-summit-west-roundup/#comments Mon, 22 Jun 2009 15:43:19 +0000 Christophe Bisciglia http://www.cloudera.com/blog/?p=802 On June 10th, more than 750 people from around the world descended on the Santa Clara Marriott to share their love for a little stuffed elephant named Hadoop. It was a good week to be part of this exploding community, and I want to extend Cloudera’s heartfelt thanks to everyone who made it possible, especially our friends at Yahoo! who organized this Summit. Most importantly, I want to thank all of you who were able to participate. I know many of you couldn’t make it to California this time, so I hope to see you at the Hadoop Summit East in October.

For those of you who couldn’t join us, I thought I would post my notes on a few of the highlights.

Hadoop Goes Mainstream:
About 300 developers attended last year’s summit, primarily from web companies and research labs. They were joined by a few forward-thinking venture capitalists. This year’s audience was both larger and different. In addition to the vibrant developer community, there was a flood of users of Hadoop. Though the audience was still dominated by web companies, attendees included traditional enterprise users with applications ranging from finance to biotech. There were technology previews from IBM and Sun. Major companies like Amazon joined our commercial efforts around Hadoop. VCs had also stepped up to sponsor status. Take-away? You ain’t seen nothing yet.

Hadoop In Print:
Yahoo! Developer Network gave away 500 copies of Tom White’s book, “Hadoop: The Definitive Guide,” published by O’Reilly. If you missed your copy, I’ve heard that when they aren’t busy developing AWS, Amazon has been known to sell a few books here and there.

Cloudera Presentation Slides:
Several Cloudera employees spoke at the Summit, and we have posted slides from those talks on the Hadoop Wiki. If you spoke, please put your slides up as well. Here are direct links to the Cloudera talks:

Cloudera Announces New Distribution Features:
We see an increasing number of users moving data between Hadoop and more traditional database products, and more and more usage moving to the cloud - especially Amazon. To that end, we’ve released two new features, and a collection of new packages, that make Hadoop easier to use.

  • Sqoop: Database Import for Hadoop. Brainchild of Aaron Kimball, Sqoop is an extensible command-line tool that copies data from a relational database into Hadoop. Sqoop uses JDBC to inspect the database schema, and automatically generates all of the code necessary to move the data. It can import data from any database over JDBC, and includes an extension to allow better performance in MySQL by using the mysqldump command.
  • EBS Integration for Hadoop on AWS: Tom White had a busy month. Besides finishing his book, he spent some time thinking about how Hadoop runs on Amazon Web Services, and came up with new code to make that better. Hadoop clusters on EC2 have always needed to copy data from S3 when they started up, and write results back to S3 before they powered down. While Amazon’s Elastic MapReduce makes this round-trip much easier operationally, EMR doesn’t support tools like Pig and Hive. Using Tom’s work, Cloudera is able to store data blocks on EBS volumes, and to connect them to EC2 nodes running Hadoop as needed. This delivers better throughput and more disks per node at lower cost , since EBS is cheaper than S3. Since no copies are required at startup and shutdown, your EC2 instances run for less time, saving CPU costs. Best of all, these changes to Hadoop work with Hive, Pig, Sqoop, and the rest of the Hadoop family. You can now load data, run jobs in your favorite language, turn your cluster off, and pick up exactly where you left off later. All your data survives.
  • Preview Release of 0.20 Packages: Matt Massie and Todd Lipcon doubled down to get our testing release packaged so that those of you who crave the bleeding edge can start experimenting with version 0.20 of Hadooop today. Over the next few weeks, we’ll be bringing in changes from other leading Hadoop developers, upgrading our customers, and releasing stable packages to the community.

Hadoop Developer Offsite
With so many Hadoop developers in the Bay Area, we decided to invite the Hadoop committers and some active developers to Cloudera’s offices. We wanted to collaborate without the assistance of email lists, JIRA, hudson, or any other technology designed to make our lives easier. We used sticky notes to identify issues in parallel, identified consensus with clusters, and broke off into smaller teams to explore solutions. Out of this, we identified five things we love and hate about Hadoop, the biggest upcoming challenges for the project and a wish list for the future. We broke into sub-projects to make concrete plans to address these issues, and we posted the meeting notes online. We’ll continue to host such meetings, and to work with other leaders in the development community. Bottom line, as Hadoop grows up, we need to grow with it, and meetings like this are a great way to coordinate development efforts with the needs of the community.

Yahoo! Distribution of Hadoop:
Long known for their leadership in the Hadoop development community, Yahoo! stepped it up again by releasing the source code that they run on their alpha clusters to the community at large. There are some things you can only learn about Hadoop from running at Y!’s scale, and while this is not a stable production distribution, their source-only (available via github) release provides 17 patches slated for inclusion in later versions of Hadoop. Cloudera is working closely with the team at Yahoo! to fold these patches into the next release of our distribution, along with dozens of patches we have developed to support customer workloads, and a half dozen or so from our friends at Amazon to improve performance on AWS. As big players like Yahoo! and Amazon continue to open their development processes, Cloudera can deliver more stable, better tested, and ultimately, more trusted code to our enterprise customers and the community at large in the packages you know and love (RPMs, Debian Packages, AMIs, etc). It’s not always easy for big companies to be open, so we’d like to thank and congratulate everyone involved.

HBase, Wow:
HBase has endured its share of criticism over the last year, but based on last week’s presentation, many of those problems have been addressed. HBase has made incredible strides in terms of reliability, availability and performance. Version 0.20 the first-ever “performance” release, and is focused on improving random access, scan and insert times. Check out these slides for details. We’re looking at an order of magnitude performance improvement, with random reads on par with traditional RDBMS. The other major improvement involves ZooKeeper integration, and eliminates the single point of failure in the master node. This strengthens the case for including HBase with the Cloudera Distribution for Hadoop. Please let us know if you want HBase support.

In Summary:
We had a great time at the summit – we learned a lot and got to talk to a lot of smart people. We’re looking forward to October’s Hadoop Summit East in New York City!

]]>
http://www.cloudera.com/blog/2009/06/22/a-great-week-for-hadoop-summit-west-roundup/feed/
Analyzing Apache logs with Pig http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/ http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/#comments Wed, 17 Jun 2009 22:22:32 +0000 dvryaboy http://www.cloudera.com/blog/?p=808 A number of organizations donate server space and bandwidth to the Apache Foundation; when you download Hadoop, Tomcat, Maven, CouchDB, or any of the other great Apache projects, the bits are sent to you from a large list of mirrors. One of the ways in which Cloudera supports the open source community is to host such a mirror.

In this blog post, we will use Pig to examine the download logs recorded on our server, demonstrating several features that are often glossed over in introductory Pig tutorials—parameter substitution in PigLatin scripts, Pig Streaming, and the use of custom loaders and user-defined functions (UDFs). It’s worth mentioning here that, as of last week, the Cloudera Distribution for Hadoop includes a package for Pig version 0.2 for both Red Hat and Ubuntu, as promised in an earlier post. It’s as simple as apt-get install pig or yum install hadoop-pig.

There are many software packages that can do this kind of analysis automatically for you on average-sized log files, of course. However, many organizations log so much data and require such custom analytics that these ordinary approaches cease to work. Hadoop provides a reliable method for scaling storage and computation; PigLatin provides an expressive and flexible language for data analysis.

Our log files are in Apache’s standard CombinedLogFormat. It’s a tad more complicated to parse than tab- or comma- delimited files, so we can’t just use the built-in PigLoader().  Luckily, there is already a custom loader in the Piggybank built specifically for parsing these kinds of logs.

First, we need to get the PiggyBank from Apache. The PiggyBank is a collection of useful add-ons (UDFs) for Pig, contributed by the Pig user community. There are instructions on the Pig website for downloading and compiling the PiggyBank. Note that you will need to make sure to add pig.jar to your CLASSPATH environment variable before running ant.

Now, we can start our PigLatin script by registering the piggybank jarfile and defining references to methods we will be using.

register /home/dvryaboy/src/pig/trunk/piggybank.jar;
DEFINE LogLoader
org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader();
DEFINE DayExtractor
org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM-dd');

By the way — the PiggyBank contains another useful loader, called MyRegExLoader, which can be instantiated with any regular expression when you declare it with a DEFINE statement. Useful in a pinch.

While we are working on our script, it may be useful to run in local mode, only reading a small sample data set (a few hundred lines). In production we will want to run on a different file. Moreover, if we like the reports enough to automate them, we may wish to run the report every day, as new logs come in. This means we need to parameterize the source data location. We will also be using a database that maps geographic locations to IPs, and we probably want to parametrize that as well.

%default LOGS 'access_log.small'
%default GEO 'GeoLiteCity.dat'

To specify a different value for a parameter, we can use the -param flag when launching the pig script:

# pig -x mapreduce -f scripts/blogparse.pig -param LOGS='/mirror.cloudera.com/logs/access_log.*'

For mapping IPs to geographic locations, we use a third-party database from MaxMind.  This database maps IP ranges to countries, regions, and cities.  Since the data from MaxMind lists IP ranges, and our logs list specific IPs, a regular join won’t work for our purposes. Instead, we will write a simple script that takes a parsed log as input, looks up the geo information using MaxMind’s Perl module, and outputs the log with geo data prepended.

The script itself is simple — it reads in a tuple representing a parsed log record, checks the first field (the IP) against the database, and prints the data back to STDOUT :

#!/usr/bin/env perl
use warnings;
use strict;
use Geo::IP::PurePerl;

my ($path)=shift;
my $gi = Geo::IP::PurePerl->new($path);

while (<>) {
	chomp;
	if (/([^\t]*)\t(.*)/) {
		my ($ip, $rest) = ($1, $2);
		my ($country_code, undef, $country_name, $region, $city)
			= $gi->get_city_record($ip);
		print join("\t", $country_code||'', $country_name||'',
			$region||'', $city||'', $ip, $rest), "\n";
	}
}

Getting this script into Pig is a bit more interesting. The Pig Streaming interface provides us with a simple way to ship scripts that will process data, and cache any necessary objects (such as the GeoLiteCity.dat file we downloaded from MaxMind).  However, when the scripts are shipped, they are simply dropped into the current working directory. It is our responsibility to ensure that all dependencies—such as the Geo::IP::PurePerl module—are satisfied. We could install the module on all the nodes of our cluster; however, this may not be an attractive option. We can ship the module with our script—but in Perl, packages are represented by directories, so just dropping the .pm file into cwd will not be sufficient, and Pig doesn’t let us ship directory hierarchies.  We solve this problem by packing the directory into a tarball, and writing a small Bash script called “ipwrapper.sh” that will set up our Perl environment when invoked:

#!/usr/bin/env bash
tar -xzf geo-pack.tgz
PERL5LIB=$PERL5LIB:$(pwd) ./geostream.pl $1

The geo-pack.tgz tarball simply contains geostream.pl and Geo/IP/PurePerl.pm .

We also want to make the GeoLiteCity.dat file available to all of our nodes. It would be inefficient to simply drop the file in HDFS and reference it directly from every mapper, as this would cause unnecessary network traffic.  Instead, we can instruct Pig to cache a file from HDFS locally, and use the local copy.

We can relate all of the above to Pig in a single instruction:

DEFINE iplookup `ipwrapper.sh $GEO`
	ship ('ipwrapper.sh')
	cache('/home/dvryaboy/tmp/$GEO#$GEO');

We can now write our main Pig script. The objective here is to load the logs, filter out obviously non-human traffic, and using the rest, calculate the distribution of downloads by country and by Apache project.

Load the logs:

logs = LOAD '$LOGS' USING LogLoader as
	(remoteAddr, remoteLogname, user, time, method,
	 uri, proto, status, bytes, referer, userAgent);

Filter out records that represent non-humans (Googlebot and such), aren’t Apache-related, or just check the headers and do not download contents.

logs = FILTER logs BY bytes != '-' AND  uri matches '/apache.*';

-- project just the columns we will need
logs = FOREACH logs GENERATE
	remoteAddr,
	DayExtractor(time) as day, uri, bytes, userAgent;

-- The filtering function is not actually in the PiggyBank.
-- We plan on contributing it soon.
notbots = FILTER logs BY (NOT
	org.apache.pig.piggybank.filtering.IsBotUA(userAgent));

Get country information, group by country code, aggregate.

with_country = STREAM notbots THROUGH `ipwrapper.sh $GEO`
		AS (country_code, country, state, city, ip, time, uri, bytes, userAgent);

geo_uri_groups = GROUP with_country BY country_code;

geo_uri_group_counts = FOREACH geo_uri_groups GENERATE
		group,
		COUNT(with_country) AS cnt,
		SUM(with_country.bytes) AS total_bytes;

geo_uri_group_counts = ORDER geo_uri_group_counts BY cnt DESC;

STORE geo_uri_group_counts INTO 'by_country.tsv';

The first few rows look like:

Country Hits Bytes
USA 8906 2.0458781232E10
India 3930 1.5742887409E10
China 3628 1.6991798253E10
Mexico 595 1.220121453E9
Colombia 259 5.36596853E8

At this point, the data is small enough to plug into your favorite visualization tools. We wrote a quick-and-dirty python script to take logarithms and use the Google Chart API to draw this map:

Bytes by Country

This is pretty interesting. Let’s do a breakdown by US states.

Note that with the upcoming Pig 0.3 release, you will be able to have multiple stores in the same script, allowing you to re-use the loading and filtering results from earlier steps. With Pig 0.2, this needs to go in a separate script, with all the required DEFINEs, LOADs, etc.

us_only = FILTER with_country BY country_code == 'US';

by_state = GROUP us_only BY state;

by_state_cnt = FOREACH by_state GENERATE
	     group,
	     COUNT(us_only.state) AS cnt,
	     SUM(us_only.bytes) AS total_bytes;

by_state_cnt = ORDER by_state_cnt BY cnt DESC;

store by_state_cnt into 'by_state.tsv';

Theoretically, Apache selects an appropriate server based on the visitor’s location, so our logs should show a heavy skew towards California. Indeed, they do (recall that the intensity of the blue color is based on a log-scale).

Bytes by US State

Now, let’s get a breakdown by project. To get a rough mapping of URI to Project, we simply get the directory name after /apache in the URI. This is somewhat inaccurate, but good for quick prototyping. This time around, we won’t even bother writing a separate script — this is a simple awk job, after all! Using streaming, we can process data the same way we would with basic Unix utilities connected by pipes.

uris = FOREACH notbots GENERATE uri;

-- note that we have to escape the dollar sign for $3,
-- otherwise Pig will attempt to interpret this as a Pig variable.
project_map = STREAM uris
			THROUGH `awk -F '/' '{print \$3;}'` AS (project);

project_groups = GROUP project_map BY project;

project_count = FOREACH project_groups GENERATE
			group,
			COUNT(project_map.project) AS cnt;

project_count = ORDER project_count BY cnt DESC;

STORE project_count INTO 'by_project.tsv';

We can now take the by_project.tsv file and plot the results (in this case, we plotted the top 18 projects, by number of downloads).
Downloads by Project

We can see that Tomcat and Httpd dwarf the rest of the projects in terms of file downloads, and the distribution appears to follow a power-law.

We’d love to hear how folks are using Pig to analyze their data. Drop us a line, or comment below!

]]>
http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/feed/
The Smart Grid and Big Data: Hadoop at the Tennessee Valley Authority (TVA) http://www.cloudera.com/blog/2009/06/02/smart-grid-big-data-hadoop-tennessee-valley-authority-tva/ http://www.cloudera.com/blog/2009/06/02/smart-grid-big-data-hadoop-tennessee-valley-authority-tva/#comments Tue, 02 Jun 2009 17:00:00 +0000 Christophe Bisciglia http://www.cloudera.com/blog/?p=763

For the last few months, we’ve been working with the TVA to help them manage hundreds of TB of data from America’s power grids. As the Obama administration investigates ways to improve our energy infrastructure, the TVA is doing everything they can to keep up with the volumes of data generated by the “smart grid.” But as you know, storing that data is only half the battle. In this guest blog post, the TVA’s Josh Patterson goes into detail about how Hadoop enables them to conduct deeper analysis over larger data sets at considerably lower costs than existing solutions. -Christophe

The Smart Grid and Big Data

At the Tennessee Valley Authority (TVA) we collect phasor measurement unit (PMU) data on behalf of the North American Electric Reliability Corporation (NERC) to help ensure the reliability of the bulk power system in North America. The Tennessee Valley Authority (TVA) is a federally owned corporation in the United States created by congressional charter in May 1933 to provide flood control, electricity generation, and economic development in the Tennessee Valley. NERC is a self-regulatory organization, subject to oversight by the U.S. Federal Energy Regulatory Commission and governmental authorities in Canada. TVA has been selected by NERC as the repository for PMU data nationwide. PMU data is considered part of the measurement data for the generation and transmission portion of the so called “smart grid”.

PMU Data Collection

There are currently 103 active PMU devices placed around the Eastern United States that actively send TVA data while new PMU devices come online regularly. PMU devices sample high voltage electric system busses and transmission lines at a substation several thousand times a second which is then reported for collection and aggregation. PMU data is a GPS time-stamped stream of those power grid measurements which is transmitted at 30 times a second each consisting of a timestamp and a floating point value. The types of information a PMU point can contain are:

  • Voltage (A,B, C phase in positive, negative, or zero sequence) magnitude and angle
  • Current (A,B, C phase in positive, negative, or zero sequence) magnitude and angle
  • Frequency
  • dF/dt (change in frequency over time)
  • Digitals
  • Status flags

Commonly just positive sequence voltages and currents are transmitted but there is the possibility for all three phases. There can be several measured voltage and current phasors per PMU (each phasor having a magnitude and an angle value), a variable number of digitals (typically 1 or 2), and one of each of the remaining 3 types of data; on average there will be around 16 total measurements sent per PMU. Should a company wish to send all three phases or a combination of positive, negative, or zero sequence data, then the number of measurements obviously increases.

The amount of this time-series data created by even a regional area of PMU devices provides a unique architectural demand on the TVA infrastructure. The flow of data from measurement device to TVA is as follows:

  1. A measurement device located at the substation (the PMU) samples various data values, timestamps them via a GPS clock, and sends them over fiber or other suitable lines to a central location.
  2. For some participant companies this may be a local concentrator or it may be a direct connection to TVA itself. Communication between TVA and these participants is commonly a VPN tunnel over a LAN-to-LAN connection but several partners utilize a MPLS connection for more remote regions.
  3. After a few network hops the data is sent to a TVA developed data concentrator termed the Super Phasor Concentrator (or SPDC) which accepts these PMUs’ input, ordering them into the correct time-aligned sequence - compensating for any missing data or delay introduced by network congestion or latency.
  4. Once organized by the SPDC, its modular architecture allows this data to be operated on by third party algorithms via a simple plug-in layer.
  5. The entirety of the stream, currently involving 19 companies, 10 different manufacturers of PMU devices, and 103 PMUs - each reporting an average of 16 measured values at a rate of 30 samples a second - with a possibility of 9 different encodings (and this only from the Eastern United States), is passed to one of three servers running an archiving application which writes the data to a size optimized fixed length binary file to disk.
  6. A real-time data stream is simultaneously forwarded to a server program hosted by TVA which passes the conditioned data in a standard phasor data protocol (IEEE C37.118-2005) to client visualization tools for use at participant companies.
  7. An agent moves PMU archive files into the Hadoop cluster via an FTP interface
  8. Alternatively, regulators such as NERC or approved researchers can directly request this data over secure VPN tunnels for operation at their remote location.

TVA currently has around 1.5 trillion points of time-series data in 15TB of PMU archive files. The rate of incoming PMU data is growing very quickly with more and more PMU devices coming online regularly. We expect to have around 40TB of PMU data by the end of 2010 with 5 years worth of PMU data estimated to be at half a petabyte (500TB).

The Case For Hadoop At TVA

Our initial problem was how to reliably store PMU data and make it available and reliable at all times. There are many brand name solutions in the storage world that come with a high price tag and the assumption of reliable hardware. With large amounts of data that spans many disks; even at a high mean time to fail (MTTF) a system will experience hardware failures quite frequently. We liked the idea of being able to lose whole physical machines and still have an operational file system due to Hadoop’s aggressive replication scheme. The more we talked with other groups using HDFS the more we came away with the impression that HDFS worked as advertised and shined even with amounts of data the “reliable hardware” struggled with. Our discussions and findings also indicated that HDFS was quite good at moving data and included multiple ways to interface with it out of the box. In the end, Hadoop is a good fit for this project in that it allows us to employ commodity hardware and open source software at a fraction of the price of proprietary systems to achieve a much more manageable expenditure curve as our repository grows.

The other side of the equation is that eventually the NERC and its designated research institutions are to be able to access the data and run operations on the data. The concept of “moving computation to the data” with map-reduce made Hadoop an even more attractive choice, especially given its price point. Many of the proposed uses of our PMU data ranged from simple pattern scans to complex data mining operations. The type of analysis and algorithms that we want to run aren’t well suited to be run in SQL. It became obvious that we were more in the market for a batch processing system such as map-reduce as opposed to a large relational database system. We were also impressed with the very robust open source ecosystem that Hadoop enjoys; Many projects built on Hadoop are actively being developed such as:

  • Hive
  • HBase
  • Pig

This thriving community was very interesting to us as it gives TVA a wealth of quality tools with which to analyze PMU data using analysis techniques that are native to “big data”. After reviewing the factors above, we concluded that employing Hadoop at TVA kills 2 birds with 1 stone — it solves our storage issues with HDFS and provides a robust computing platform with map reduce for researchers around North America.

PMU Data Analysis at TVA

Currently our analysis needs and wants are evolving with our nascent ideas on how best to use PMU data. Current techniques and algorithms on the board or in beta include

We are currently writing map reduce applications to be able to crunch far greater amounts of power grid information than has be previously possible. Using traditional techniques to calculate something as simple as an average frequency over time can be an extremely tedious process because of the need to traverse terabytes of information; map-reduce allows us to not only parallelize the operation but also get much higher disk read speeds by moving the computation to the data. As we evolve our analysis techniques we plan to expand our range of indexing techniques from simple scans to more complex data mining techniques to better understand how the power grid reacts to fluctuations and how previously thought discrete anomalies may, in fact, be interconnected.

Additionally, we are also adding other devices such as Frequency Disturbance Recorders (FDRs, a.k.a. F-NET devices which are developed by Virginia Tech) to our network. Although these devices send samples at a third of the rate of PMU devices with a reduced measurement set, there exists the potential for many hundreds of these less expensive meters to come online which would effectively double our storage requirements. This FDR data would be interesting in that the extra data would allow us to create a more complete picture of the power grid and its behavior. Hadoop would allow us to continue scaling up to meet the extra demand not only for storage but for processing with map reduce as well. Hadoop gives us the flexibility and scalability to meet future demands that can be placed upon the project with respect to data scale, processing complexity, and processing speed.

Looking Forward With Hadoop

As we move forward using Hadoop, there are a few areas we’d like to see improved. Security is a big deal in our field, especially given the nature of the data and agencies involved. We would like to see security continue to be improved by the Hadoop community as a whole as time goes on. Security internally and externally is a big part of what we do, so we are always examining our production environment to make sure we fulfill our requirements. We also are looking at ways to allow multiple research projects to coexist on the same system, such that they share the same infrastructure but can queue up their own jobs and download the results from their own private account area while only having access to the data that their project allows. Research can be a competitive business and we are looking for unique ways to allow researchers to work with the same types of data while feeling comfortable about their specific work remaining private; additionally we are required to maintain the privacy of all the data providers - researchers will only be allowed to access a filtered set of measurements as allowed by the data providers or as deemed available for research by the NERC.

In our first discussions about whether or not we would explore cloud computing as an option for processing our PMU data, we wanted to know if there was a “Redhat-like” entity in the space that could answer questions and provide support for Hadoop. Cloudera has definitely stepped up to the plate to fulfill this role for Hadoop. Cloudera provides exceptional support in a very dynamic space, a space in which many companies have no experience and many consulting firms can provide no solid advice. Cloudera was quick to make sure that Hadoop was right for us and then provided extremely detailed answers to all of our questions and what-if scenarios. Their whole team was exceptionally adept in getting back to us on a myriad of details most sales or “front line support” teams would be stymied by. Cloudera’s distribution for Hadoop and guidance on hardware acquisition helped in saving us money and getting our evaluation of Hadoop off the ground in a very short amount of time.

]]>
http://www.cloudera.com/blog/2009/06/02/smart-grid-big-data-hadoop-tennessee-valley-authority-tva/feed/
Introducing Sqoop http://www.cloudera.com/blog/2009/06/01/introducing-sqoop/ http://www.cloudera.com/blog/2009/06/01/introducing-sqoop/#comments Mon, 01 Jun 2009 17:00:37 +0000 aaron http://www.cloudera.com/blog/?p=759 In addition to providing you with a dependable release of Hadoop that is easy to configure, at Cloudera we also focus on developing tools to extend Hadoop’s usability, and make Hadoop a more central component of your data infrastructure. In this vein, we’re proud to announce the availability of Sqoop, a tool designed to easily import information from SQL databases into your Hadoop cluster.

Sqoop (”SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:

  • Imports individual tables or entire databases to files in HDFS
  • Generates Java classes to allow you to interact with your imported data
  • Provides the ability to import from SQL databases straight into your Hive data warehouse

After setting up an import job in Sqoop, you can get started working with SQL database-backed data from your Hadoop MapReduce cluster in minutes.

Motivation

Hadoop MapReduce is a powerful tool; its flexibility in parsing unstructured or semi-structured data means that there is a lot of potential for creative applications. But your analyses are only as useful as the data which they process. In many organizations, large volumes of useful information are locked away in disparate databases across the enterprise. HDFS, Hadoop’s distributed file system represents a great place to bring this data together, but actually doing so is a cumbersome task.

Consider the task of processing access logs and analysing user behavior on your web site. Users may present your site with a cookie that identifies who they are. You can log the cookies in conjunction with the pages they visit. This lets you coordinate users with their actions. But actually matching their behavior against their profiles or their previously recorded history requires that you look up information in a database. If several MapReduce programs needed to do similar joins, the database server would experience very high load, in addition to a large number of concurrent connections, while MapReduce programs were running, possibly causing performance of your interactive web site to suffer.

The solution: periodically dump the contents of the users database and the action history database to HDFS, and let your MapReduce programs join against the data stored there. Going one step further, you could take the in-HDFS copy of the users database and import it into Hive, allowing you to perform ad-hoc SQL queries against the entire database without working on the production database.

Sqoop makes all of the above possible with a single command-line.

Example Usage

Continuing the example above, let’s say that our front end servers connected to a MySQL database named website, stored on db.example.com. The website database has several tables, but the one we are most interested in is one named USERS.

This table has several columns; it might have been created from a SQL statement like:

CREATE TABLE USERS (
  user_id INTEGER NOT NULL PRIMARY KEY,
  first_name VARCHAR(32) NOT NULL,
  last_name VARCHAR(32) NOT NULL,
  join_date DATE NOT NULL,
  zip INTEGER,
  state CHAR(2),
  email VARCHAR(128),
  password_hash CHAR(64));

Importing this table into HDFS could be done with the command:

you@db$ sqoop --connect jdbc:mysql://db.example.com/website --table USERS \
    --local --hive-import

This would connect to the MySQL database on this server and import the USERS table into HDFS. The –-local option instructs Sqoop to take advantage of a local MySQL connection which performs very well. The –-hive-import option means that after reading the data into HDFS, Sqoop will connect to the Hive metastore, create a table named USERS with the same columns and types (translated into their closest analogues in Hive), and load the data into the Hive warehouse directory on HDFS (instead of a subdirectory of your HDFS home directory).

Suppose you wanted to work with this data in MapReduce and weren’t concerned with Hive. When storing this table in HDFS, you might want to take advantage of compression, so you’d like to be able to store the data in SequenceFiles. In this case, you might want to import the data with the command:

you@db@ sqoop --connect jdbc:mysql://db.example.com/website --table USERS \
    --as-sequencefile

Sqoop will also emit a Java class named USERS with getter methods for each of the columns of the table.

They support the majority of SQL’s types including optionally-null values. The data will be loaded into HDFS as a set of SequenceFiles; you can use the USERS.java class to work with the data in your MapReduce analyses.

Sqoop can also connect to other databases besides MySQL; anything with a JDBC driver should work. If you are running locally on a MySQL server the import will be especially high-performance, but a MapReduce-based import mechanism allows remote database connections as well. Authenticated connections with usernames and passwords are also supported. Several other options allow you to control which columns of a table are imported, and other aspects of the import process. The full reference manual is available at www.cloudera.com/hadoop-sqoop.

A Closer Look

In this section I’ll briefly outline how Sqoop works under the hood.

In an earlier blog post, I described the DBInputFormat, a connector that allows Hadoop MapReduce programs to read rows from SQL databases. DBInputFormat allows Hadoop to read input from JDBC: a Java interface to databases that most popular database vendors (Oracle, MySQL, Postgresql, etc.) implement.

In order to use DBInputFormat you need to write a class that deserializes the columns from the database record into individual data fields to work with. This is pretty tedious—and entirely algorithmic. Sqoop auto-generates class definitions to deserialze the data from the database. These classes can also be used to store the results in Hadoop’s SequenceFile format, which allows you to take advantage of built-in compression within HDFS too. The classes are written out as .java files that you can incorporate in your own data processing pipeline later. The class definition is created by taking advantage of JDBC’s ability to read metadata about databases and tables.

When Sqoop is invoked, it retrieves the table’s metadata, writes out the class definition for the columns you want to import, and launches a MapReduce job to import the table body proper.

Hadoop users know that moving large volumes of data can be a time-intensive operation. While it provides a reliable implementation-independent mechanism to read database tables, using a MapReduce JDBC job to import data from a remote database is often inefficient. Database vendors usually provide an export tool that exports data in a more high-performance manner. Sqoop is capable of using alternate import strategies as well. By examining the connect string URL that tells Sqoop which database to connect to, Sqoop will choose alternate import strategies as appropriate to the database. We’ve already implemented the ability to take advantage of MySQL’s export tool called mysqldump. We’ll add support for other systems as soon as we can.

Getting Sqoop

The first beta release of Sqoop is available today as part of Cloudera’s Distribution for Hadoop. It installs as part of the same RPM (or Debian package) that contains Hadoop itself.

Hadoop users who aren’t using our distribution can apply the patch that is contributed to Apache Hadoop as issue HADOOP-5815, and compile it themselves, but Sqoop won’t be part of the standard Hadoop release for some time (at least until version 0.21.0). mysqldump support is added in HADOOP-5844, and Hive integration is provided in HADOOP-5887.

You can read the documentation for Sqoop at http://www.cloudera.com/hadoop-sqoop. You can also get some basic usage information from Sqoop itself by running sqoop –-help after it’s installed.

We also did a preview of this tool at the May Bay Area Hadoop User Group meet-up; you can catch the presentation here:


View on Vimeo.

We hope you find this tool useful—please check it out! Then let us know your feedback on GetSatisfaction. Bug reports and feature requests especially welcome.

]]>
http://www.cloudera.com/blog/2009/06/01/introducing-sqoop/feed/
Common Questions and Requests From Our Users http://www.cloudera.com/blog/2009/05/29/common-questions-and-requests-from-our-users/ http://www.cloudera.com/blog/2009/05/29/common-questions-and-requests-from-our-users/#comments Fri, 29 May 2009 16:00:44 +0000 alex http://www.cloudera.com/blog/?p=731 A few months ago we announced the Cloudera Distribution for Hadoop.  We’re happy to report that lots of people have started using our distribution, and our GetSatisfaction product (which is essentially a message board about our products) has seen lots of good Hadoop questions and answers.  We thought it would be worthwhile to share some of the interesting questions and requests we’ve seen from our users.

Question: How do I backup my name node metadata?
The name node (NN) stores all of the HDFS metadata, which includes file names, directory structures, and block locations.  This metadata is stored in memory for fast lookup, but the NN also maintains two on-disk data structures to ensure that metadata is persisted.  The first structure stored is a snapshot of the in-memory metadata, and the second structure stored is an edit log of changes that have been made since the snapshot was last taken.  The secondary name node (2NN) is in charge of fetching the snapshot and edit log from the NN and merging the two into a new snapshot, which is then sent back to the NN.  Once the NN gets the new snapshot, it clears its edit log, and the process repeats.  Take a look at our other blog post about multi-host secondary name nodes for more information about configuring the 2NN.

There are two types of metadata backups that one should implement, and each type solves a different problem.  I will talk about each of these backup strategies separately.  The first backup strategy is used to ensure that no metadata is lost in the event of a NN failure, whether that failure be disks dying, power supplies catching fire, or some other unforeseen loss of the NN or its local data.  The way to avoid losing NN metadata in the event of a crash is to configure dfs.name.dir such that it writes to several local disks and at least one NFS mount.  dfs.name.dir takes a comma-separated list of local filesystem paths, so an example configuration might look like “/hdd1/hadoop/dfs/name,/hdd2/hadoop/dfs/name,/mnt/nfs/hadoop/dfs/name”.  The purpose of storing data on several local hard drives is to avoid data loss in the case of a single drive failing.  The purpose of storing data on a NFS mount is to avoid data loss in the case of the NN machine going down entirely.  With at least two local drives and one NFS mount storing the same NN metadata, you should be well protected from losing any data from a crash.  To be fair, NFS isn’t the only solution for mounting a remote file system, but it’s the de facto standard for Hadoop.

The second backup strategy is used to allow recovery from accidental data loss due to user error (such as a careless hadoop fs -rmr /*).  As mentioned earlier, the 2NN is able to fetch the NN’s metadata snapshot and edits log over HTTP.  That said, if you’d like to perform hourly or nightly backups of the NN metadata, you can do so by querying the following URLs:

  • Snapshot: http://nn.domain.com:50070/getimage?getimage=1
  • Edits log: http://nn.domain.com:50070/getimage?getedit=1

Note that using LVM snapshots to backup the snapshot and edit log is also a good idea; LVM snapshots allow for more reliable backups.

To recover from a NN failure, or to restore from a backup, just take the edits log and snapshot — either from the NFS server or from your backup archive — and place them in the following places:

  • Snapshot: dfs.name.dir/current/fsimage
  • Edits log: dfs.name.dir/current/edits

Note that the NN daemon should not be running when you change its snapshot and edits log.

Bonus link: learn more about protecting data node metadata.

Request: I want Cloudera’s Distribution for Hadoop on Mac OS X.
We distribute our Distribution for Hadoop by providing RPMs, DEBs, and AMIs.  RPMs are installable on Redhat-based Linux distributions such as CentOS, RHEL, and Fedora. DEBs are installable on Debian-based Linux distributions such as Debian and Ubuntu. AMIs are machine images used for running our distribution in EC2. We’ve had several people request packaging for Mac OS X.  Our near-term solution for getting our distribution on Macs is to provide tarballs similar to the tarballs you download for vanilla Hadoop.  Perhaps at some later time we’ll provide self-installing DMGs, but we don’t have them on the road map.

Question: My MapReduce jobs throw Exceptions saying that I’ve ran out of disk space, but I have plenty of disk space.  What’s up?
Most users who run into this have misconfigured hadoop.tmp.dir. If you’re using vanilla Hadoop, then hadoop.tmp.dir will be configured to /tmp.  This is problematic, because most Linux installations have a quota on /tmp, making Hadoop think it’s out of disk space when it tries to write temporary data.  Be sure to configure hadoop.tmp.dir to a directory that has plenty of space; it’s fine for hadoop.tmp.dir to write to the same partitions (not directories, though) as dfs.data.dir, dfs.name.dir, etc.  As long as each of these parameters write to different folders, Hadoop will manage disk space in a reasonable way.

Also worth noting is that mapred.local.dir should be configured to write to multiple disks, rather than relying on hadoop.tmp.dir.

Request: We want Pig 0.2.0!
Yes, we know :).  We’ve had several requests for Pig 0.2.0 to be included in our distribution.  It’s coming soon!  Stay tuned for an announcement.

Question: I have a big NFS server at my disposal; how can I use it for my Hadoop cluster?
One of Hadoop’s design goals was to avoid having data be stored in a single place such as a NFS server.  Hadoop is able to analyze and compute lots of data because data is distributed across many nodes, allowing for spatial locality when computing, and also allowing for files to be read from several data nodes in parallel.  If a NFS mount were used to store HDFS data, then the NFS server would certainly become a bottleneck and would slow down your entire cluster, because several task trackers would request data from the NFS server at once, probably lighting the NFS server on fire and bringing your job to a crawl.  Though a NFS server can’t really help you compute more data faster, it can make your operational tasks easier.  An ops engineer can use NFS to ensure that each node has the same Hadoop code, dependent files (e.g., if you’re using Hadoop Streaming), and consistent home directories.  As long as Hadoop is not reading or writing HDFS data to or from a NFS mount, the NFS server should not be a bottleneck.  A NFS server could also be used to collect Hadoop logs from all machines in a small cluster, though using Scribe for log collection is a much more scalable solution.

I have more questions!
Do you have more questions about Hadoop, Pig, Hive, or our Distribution for Hadoop?  There are several ways in which you can get your questions answered.  If you have a general Hadoop, Hive, or Pig question, then you are most likely to get the best response on the user lists: pig-user, hive-user, hadoop-user.  Cloudera engineers participate in these lists a lot as well.  If you have questions about Cloudera’s Distribution for Hadoop, then post a message to our GetSatisfaction message board.  We’re always happy to help out.

]]>
http://www.cloudera.com/blog/2009/05/29/common-questions-and-requests-from-our-users/feed/
Building a distributed concurrent queue with Apache ZooKeeper http://www.cloudera.com/blog/2009/05/28/building-a-distributed-concurrent-queue-with-apache-zookeeper/ http://www.cloudera.com/blog/2009/05/28/building-a-distributed-concurrent-queue-with-apache-zookeeper/#comments Thu, 28 May 2009 22:30:15 +0000 henry http://www.cloudera.com/blog/?p=695 In my first few weeks here at Cloudera, I’ve been tasked with helping out with the Apache ZooKeeper system, part of the umbrella Hadoop project. ZooKeeper is a system for coordinating distributed processes. In a distributed environment, getting processes to act in any kind of synchrony is an extremely hard problem. For example, simply having a set of processes wait until they’ve all reached the same point in their execution - a kind of distributed barrier - is surprisingly difficult to do correctly. ZooKeeper offers an API to facilitate this sort of distributed coordination. For example, it is often used to serve locks to client processes - locks are just another kind of coordination primitive - in the form of small files that ZooKeeper tracks.

In order to be useful, ZooKeeper must be both highly reliable and available as systems will rely upon it as a critical component. For example, if locks cannot be taken, processes cannot make progress and the whole system will grind to a halt. ZooKeeper is built on a suite of reliable distributed systems techniques and protocols, and is typically run on a cluster of machines so that if some should fail, the remaining ones can continue to provide service. Under the hood, ZooKeeper is responsible for ordering calls made by clients so that each request is processed atomically and in a fixed and firm order.

One of my first contributions to the project was a set of bindings to allow programs written in the Python language to act as clients to a ZooKeeper cluster. ZooKeeper was natively written in Java, and there are already C and Perl bindings. Adding Python bindings increases the number of people that can use the system, and brings the strengths of Python, such as rapid prototyping, to bear when designing distributed systems.

The Python ZooKeeper bindings are available from the ZooKeeper SVN repository and should be part of the 3.2 release, planned for the next couple of weeks. To use the bindings now, you can either check out the latest version of the code from the SVN repository, or download a tarball containing a recent snapshot here. The zookeeper module exposes the ZooKeeper API to Python, so to get started all you need do is add import zookeeper to your Python script once the module is installed. Instructions on getting up and running are at the end of this post.

To illustrate some of the ZooKeeper API, I’ve written a distributed FIFO queue in Python - the source code is here - which I wanted to share. The combination of Python and Zookeeper meant that I was able to write the queue in just over 60 lines of code, and most of that deals with local coordination issues between two threads rather than any tricky issues trying to make remote processes behave correctly. I can only give a taste here of how programming with Python and ZooKeeper works. I hope there’s enough here to convince you that ZooKeeper might make a useful component for distributed systems that need a little herding.

ZooKeeper

ZooKeeper provides a tree abstraction where every node in that tree (or znode, in ZooKeeper parlance) is a file on which a variety of simple operations can be performed. ZooKeeper orders operations on znodes so that they occur atomically. Therefore there is no need to use complex locking protocols to ensure that only one process can access a znode at a time. The tree represents a hierarchical namespace, so that many distinct distributed systems can use a single ZooKeeper instance without worrying about their files having the same name.

Each znode has some associated data - up to a megabyte in current builds - that can be updated atomically. Every update to a znode increases its version number, which allows clients to perform compare-and-swap operations by reading the version and then updating a znode only if the version is still the one that was read.

As a notification mechanism, ZooKeeper provides watches, which are callback methods that are called asynchronously when an event of interest occurs. Watches are attached, typically, to an individual znode. When that znode changes any watcher on the znode will be fired asynchronously on the client. Many methods of the ZooKeeper API have an optional watch argument. Some languages have to work hard to provide callable objects as parameters, but Python makes this easy as callables are first class language constructs. Simply pass any callable, like a method or a lambda expression, to the zookeeper module and when an event of interest occurs, the callable will be executed.

This call comes from a separate thread of execution, so great care must be taken to ensure that unexpected things do not happen due to your watcher being fired at an arbitrary point in the execution of your script. Normally you will use watchers to notify another thread of a state change. It will often be the case that the main thread will be waiting for the watcher to fire before it can continue. An example of this is in the __init__ method of our ZooKeeperQueue when we try to connect to the server. Compared to the time a script takes to execute, connections can take a long time to run. So it’s useful that the ZooKeeper API allows us to connect asynchronously, in case there were any work that we wanted to get done while we were waiting for the connection to be established. However, in our case, we just want to wait until the connection is successful, and so we need a mechanism to wait for the watcher to notify us.

A useful tool for this inter-thread communication is the Condition object in Python, which represents a condition variable, a well-known concurrent programming abstraction. Condition objects may be acquired and released just like locks, but they also expose an API to wait for a notification from another thread and to fire that notification. While a thread is waiting on a Condition it goes to sleep, leaving the operating system with some free CPU to dedicate to other processes. Once a Condition is notified, a thread that is waiting on it is woken up and allowed to continue execution once the notifying thread has released the Condition.

This leads to a simple pattern for communicating between watchers and the main thread. Here’s an excerpt from the connection code:

def watcher(handle,type,state,path):
    print "Connected"
    self.cv.acquire()
    self.connected = True
    self.cv.notify()
    self.cv.release()

self.cv.acquire()
self.handle = zookeeper.init("localhost:2181", watcher, 10000, 0)
self.cv.wait(10.0)

First we define our watcher which takes four parameters (if you want to provide more parameters or local state to a watcher, one way to do it is to wrap a function call in a local lambda which captures the state). The next line acquires an exclusive lock on a condition variable cv. Why do this now? Once we set our watcher in place, it could be fired at any time - even before the main thread makes progress to the next line of code. If we don’t prevent it from sending a notification on the condition variable before we’re ready to look for it, the notification could get lost and we could wait forever. Notifications aren’t buffered - if no one is waiting on a condition variable, no one gets woken up.

Then the code initialises ZooKeeper. The zookeeper module gives us an integer handle which we can use to refer to our connection in the future (we can open many connections per client). The next line tells us to wait until we receive a notification on the condition variable that the connection has succeeded. The parameter is a timeout in seconds, after which if we are still not connected we presume that something is wrong and abort.

The ZooKeeper queue

A FIFO queue is a simple data structure where producers put items in, and consumers retrieve them in the order they were put in. There are only two operations on a basic queue: enqueue adds an item and dequeue removes it. Despite their simplicity, queues crop up very often in distributed systems - for example, in job submission systems where clients submit requests to a set of workers which serve the requests on a first-come, first-served basis.

The ZooKeeper queue is structured very simply. All items are stored as znodes under a single top-level znode which represents a queue instance. Consumers can retrieve items by getting and then deleting a child of the top-level znode. The code creates a queue by calling a single create command. If the queue already exists, the Python module will throw an exception which we catch. This is a design decision that is still in review - future versions of the bindings might return integer error codes, and rely on the user to throw an exception if required.

zookeeper.create(self.handle,self.queuename,"queue top level", [ZOO_OPEN_ACL_UNSAFE],0)

The first two arguments to this call identify the connection to the ZooKeeper service and the name of the znode. The third is the data the znode contains. We won’t be accessing the data so we write some placeholder text.

The fourth argument is an access control list of permissions that controls who can access the znode in the future. ZooKeeper provides fairly fine-grained control over access, but the subject is beyond the scope of this post. What we have done here is to create the queue znode so that any client can read or write to it.

Adding and deleting items from the queue

Although I explained how consumers retrieve items from the queue, I said nothing about how they make sure they are retrieving items in FIFO order. What we would like is a way of naming each item such that later items are ordered lexicographically after earlier ones. If we can retrieve items in the same order, we’ll have our queue. Thankfully, ZooKeeper provides a very handy flag for the create call that helps us out. Specifying the zookeeper.CREATE_SEQUENCE flag appends each znode name with an sequence number suffix that increases monotonically with each new znode that is created. ZooKeeper ensures that the sequence numbers are applied in order and are not reused.

Enqueuing an item is therefore a simple one liner. We don’t have to take out any locks to ensure that access to the queue znode is serialised. Items may be queued concurrently, and ZooKeeper takes care of assigning sequence numbers to them in the order they were received.

Dequeuing an item is also straightforward, but a bit more involved. First we retrieve a list of all the items waiting to be queued from ZooKeeper with the get_children procedure call. Then, after sorting the list of items on the client, we get the contents of the znode (i.e. the item’s data) and then try to delete it.

It is possible that this deletion will fail because some other consumer has managed to successfully retrieve the item beforehand. We could ensure that this would never happen by organising for a queue-wide lock - this is easily implemented in ZooKeeper (although left as an exercise for the reader). However, this would severely impact performance by only allowing a single consumer to access the queue at one time. Instead, the client simply deals with the failed delete - again, indicated via an exception - and moves on to the next child znode in the list. If the client reaches the end of the list without successfully deleting an item, it should issue another get_children call to make sure that no items were added while the original list was being scanned. Once the get_children call returns an empty list, the dequeue procedure gives up and returns None.

Blocking reads

Sometimes we might want to block until an item is available to retrieve. It would be inefficient to copy exactly the non-blocking approach and simply loop, issuing get_children requests until an item was found. Instead, we can leverage ZooKeeper’s watcher mechanism to provide an asynchronous notification when a new znode is created as a child of the queue znode. The code to accomplish this is a combination of the patterns we’ve seen already in the dequeue and connection code.

def block_dequeue(self):
    def queue_watcher(handle,event,state,path):
        self.cv.acquire()
        self.cv.notify()
        self.cv.release()
    while True:
        self.cv.acquire()
        children = sorted(zookeeper.get_children(self.handle, self.queuename, queue_watcher))
        for child in children:
            data = self.get_and_delete(self.queuename+"/"+children[0])
            if data != None:
                self.cv.release()
                return data
        self.cv.wait()
        self.cv.release()

First the client acquires a lock to prevent the watcher sending a notification when the client is unready. Then, as in the dequeue method, the client retrieves a list of items, but here a watcher parameter is specified. The watcher will fire whenever any event is seen that is relevant to the queue znode. The watcher acquires the lock - blocking until the client has given it up - and then notifies the client that there may be more items available.

The client only waits for this notification if all the children returned from get_children have already been consumed by others - otherwise it will successfully retrieve an item and return it. Once all possible items have been exhausted, the client waits on the condition variable. After being woken up, it repeats the same list-read-delete-wait loop.

Failure modes

ZooKeeper operations can fail in a number of ways. In order to keep this example simple, most errors are raised as exceptions and the queue aborts. A more robust implementation should catch errors at every ZooKeeper invocation, as many can be recovered from with a little effort.

The zookeeper.CONNECTIONLOSS error condition is particularly worth noting. ZooKeeper may drop a client connection at any time, due to physical link loss, network congestion or other connection problem. This can cause ZooKeeper API invocations to abort before the ZooKeeper cluster is able to inform the client of the operation’s success. This is problematic for our queue, as enqueue operations may or may not have succeeded when we receive a CONNECTIONLOSS error.

There are several approaches we can take to this problem. The first is to blindly retry enqueue when a connection is lost. This could result in an item being queued several times, but for some systems this is not a significant problem. For example, if a web page is crawled twice, apart from the time cost there will be no hardship caused to a indexing engine.

For some applications, duplication of enqueue operations is problematic. The obvious ’solution’ is to check whether an item is in the queue after it has been queued. However, it is possible that a consumer will have retrieved and deleted the item between the connection loss event and the subsequent reconnection and existence check. Instead, a two-phase protocol is necessary where a producer marks an item as ‘consumable’ only when it is sure it is in the queue, by atomically updating its associated data with a flag. Consumers may only retrieve items for which the flag is set. If a connection loss occurs during the setting of this flag, recovery is easier as the set call may be reissued - if the item is no longer present in the queue, the only possible explanation is that the original flag update succeeded and the item has been consumed. This is not built into the example code, but a production system should implement a similar form of connection loss recovery.

Taking care of failure modes like this one often comprises most of the work of building a distributed system. The key is to understand every exception that API calls can throw, and to know what your code does in every circumstance.

Using the queue

To use the queue, you must first make sure you have built and installed both the C client libraries and the zookeeper Python module. There are two prerequisite packages: the cppunit development package and the Python development package. On yum-based systems, these are named cppunit-devel and python-devel. Both packages are available through standard platform package managers like yum, apt and Darwin ports.

As a prerequisite to building the C client libraries, the Java based-server must be built. This auto-generates some header files that the C libraries rely on. From the root directory of the downloaded distribution:

ant

The C client libraries for ZooKeeper must be installed as the Python module makes use of them to actually communicate with a ZooKeeper cluster. It’s easiest to build these from source. From the src/c directory, type the following:

autoreconf -if
./configure
make && sudo make install

The downloadable package contains the source code for the Python module. To build and install, one command should do the trick from the src/contrib/zkpython directory:

ant install

To test the installation, start a Python shell and type import zookeeper. If you don’t see any errors or warnings, the module has been built and installed successfully. The bindings have been tested with Python 2.3, 2.4, 2.5 and 2.6, and are known not to work with 2.2 and earlier. We haven’t yet tested them against Python 3.x - we’d love to hear your feedback about your experiences with the latest versions of Python.

To run the queue example, you must have a ZooKeeper server running on the local machine at port 2181 (to change the location of the server, edit the string passed to zookeeper.__init__). The Java-based server will have been built when you ran ant from the root directory of the distribution earlier. Before the server can run, it needs a configuration file to read:

cat >> conf/zoo.cfg
tickTime=2000
dataDir=/tmp/zookeeper
clientPort=2181

Now you can run bin/zkServer.sh start to start a standalone server on the local machine. To stop the server in the future, run bin/zkServer.sh stop.

You’re finally ready to run the queue example:

python queue.py

The example is very simple. It queues three items, and then dequeues them.

Wrapping up

I hope that I’ve shown you that ZooKeeper is a very useful system, with powerful primitives that makes writing tricky distributed concurrent programs easier. There are many applications that ZooKeeper could help you build - lock servers, name services, metadata stores and even a unique kind of filesystem can be built in a straightforward way using the ZooKeeper API. The project is active and always looking for volunteers. ZooKeeper integration is already being built into HBase, and there are moves to bring greater reliability to Hadoop and HDFS by delegating some server functionality to ZooKeeper. As far as the Python bindings go, the next version will include better documentation, some more Python niceties such as default parameters and docstrings, and a more Pythonic wrapper object to wrap up some of the bookkeeping that ZooKeeper requires.

]]>
http://www.cloudera.com/blog/2009/05/28/building-a-distributed-concurrent-queue-with-apache-zookeeper/feed/
Announcing Cloudera Certification for Hadoop http://www.cloudera.com/blog/2009/05/28/cloudera-certification-for-hadoop/ http://www.cloudera.com/blog/2009/05/28/cloudera-certification-for-hadoop/#comments Thu, 28 May 2009 13:00:06 +0000 Christophe Bisciglia http://www.cloudera.com/blog/?p=705 As Hadoop continues to turn heads at startups and big enterprises alike, Cloudera has received several requests to offer certification in addition to our popular training programs.

Certification is a critical component of any software ecosystem, and especially so for open source projects with quickly expanding user bases. Certification allows developers to ensure their skills are up to date, and allows employers and customers to confidently identify individuals that are up for the challenge of solving big data problems with Hadoop.

To that end, we are happy to announce Cloudera Certification for Hadoop.

Starting next month, participants looking to document their experience from our Hadoop Training programs can register for our Cloudera Certified Hadoop Professional (CCHP) exam. You’ll receive a paper certificate, but more importantly, we’ll record your test date so we can verify your certification to third parties like employers and potential clients.

Our first scheduled certification exam is on June 23rd in Washington DC. Register here.

]]>
http://www.cloudera.com/blog/2009/05/28/cloudera-certification-for-hadoop/feed/