Hack Getting Started Documentation

The Hadoop hack is being run by Cloudera on a dedicated small cluster of machines running Hadoop 0.18.1. If you have not used Hadoop before, you can read more about it on the Hadoop website.

During the hack you will be able to run your own MapReduce programs against the data on the cluster -- hopefully these programs will produce something cool or of interest! There are prizes for the best hacks.

Getting started

  1. Get an account if you haven't done so already by registering. Follow the steps outlined in the registration process. Note that account creation can take a few minutes, so if you cannot run jobs immediately, then be patient. If you can't run jobs within 10 minutes, then please email Hadoop hack support.
  2. Install Hadoop 0.18.1 on your local machine. Download the distribution from Apache, unpack it, and set the following environment variables: JAVA_HOME to point to a JDK 6 installation, HADOOP_HOME to point to the top-level of the Hadoop distribution, and finally add $HADOOP_HOME/bin to your PATH.
  3. Set up hadoop-site.xml. To run against the hack cluster, download this copy of hadoop-site.xml (note: you may have to click File * Save As; some browsers may try and parse the XML) and drop it into $HADOOP_HOME/conf. Then open it in an editor and replace the two occurrences of YOUR_USER_NAME with your username, as shown in the top left-hand corner of this page.
  4. Log in to the gateway machine and change your password. ssh to username@gateway.cloudera.com using the username you signed up with here. Your (one-time) password is the same as your username. You will then need to change this password to something else of your choosing. Do so. Your connection to the server will be terminated immediately afterward.
  5. Set up a SOCKS proxy. Use an ssh client to set up a SOCKS proxy to the gateway node by running: ssh -D 2600 username@gateway.cloudera.com. If you are in Windows, you can use OpenSSH in cygwin for this, or the "Putty" ssh client. If you're using Putty, you should add a "Dynamic" tunnel on port 2600. Enter the password you just set for yourself. Then just minimize this ssh connection and forget about it.
  6. Configure FoxyProxy so you can access the web interface. If you want to see the JobTracker and NameNode status pages, you'll need to configure a web browser to use your SOCKS proxy. This is most straightforward with FireFox:
    • Download FoxyProxy from http://foxyproxy.mozdev.org/downloads.html. Install the extension and restart FireFox
    • If it prompts you to configure FoxyProxy, click "yes;" if not, go to Tools * FoxyProxy * Options
    • In the proxies tab, click "add new proxy."
      • Proxy name: Hadoop hack
    • Under "proxy details," select Automatic proxy configuration URL, and set it to http://www.cloudera.com/sites/default/files/proxy.pac
    • Click "OK"
    • Under Mode select Use proxy "Hadoop hack" for all URLs
    • Click "Close" to exit all the options.
    • You should now be able to surf the web regularly, while still redirecting the appropriate traffic through the SOCKS proxy to access cluster information.
  7. Test FoxyProxy by navigating to http://server1.cloudera.com:50030/. This should show you a MapReduce status page. If not, check the configuration steps above. For this step (and all steps below), the ssh tunnel with -D 2600 must be open.
  8. Check your user directory in HDFS has been set up. When you registered for the hack, a user directory should have been created for you by Cloudera support. Check it's there by typing the following (substituting your username):
    hadoop fs -ls /user/username
    

    If you get the message No such file or directory then your directory has not been created: please email hack-support@cloudera.com asking to have it set up before proceeding.

  9. Check you can run a MapReduce program. It's worth trying to run a simple MapReduce program to check everything is working smoothly. Run the following job:
    hadoop jar $HADOOP_HOME/hadoop-0.18.1-examples.jar wordcount /shared/enron-emails/small/a* sanity-check
    

    You can track the job using the job tracker page at http://server1.cloudera.com:50030/. It should run without errors.

  10. Write a hack. Look at the datasets we have made available below, then create a page that describes your hack. You can update this page with notes as you work on the hack. You can also attach files to the page, and have discussions using the commenting feature. You can see all the hacks currently being worked on here.

The Datasets

For this hack we have the following datasets available. Please read the documentation for each dataset before using it, to find out more about the format of the data, and some ideas for what you might do with it.

Dataset What is it? Approximate size HDFS Path Notes
Enron email Emails from 150 employees of Enron, mostly executives 1.3GB /shared/enron-emails Documentation
NCDC weather records 100 years of global weather station readings 138GB /shared/ncdc-weather Documentation
Spinn3r web crawl A large chunk of the spinn3r blog corpus. 100GB /shared/spinn3r Documentation
TIGER US topological data 0.5GB /shared/tiger Documentation
Wikipedia page contents Current versions of each page on wikipedia 18GB /shared/wiki-articles Documentation
Wikipedia page contents with history All revision history for each page on wikipedia 1TB /shared/wiki-history Documentation

Writing Your Hack

Each dataset comes in smaller cut down versions ("small" and "micro") to facilitate testing. You are encouraged to download the "micro" version of each so you can unit test your code on your own machine. When the tests are passing, try running against the "small" version on the cluster. When you are happy with these results, try running against the full dataset. Remember that the cluster is a shared resource, so try to limit the number of full runs that you do to the minimum necessary.

To run against a full dataset, you need to submit your job to a special queue (this is so we can monitor who is running jobs against full datasets). To do this you need to specify the Hadoop property queue.name to be large-datasets. For example:

hadoop jar $HADOOP_HOME/hadoop-0.18.1-examples.jar wordcount -D queue.name=large-datasets /shared/enron-emails/full enron-wc

If you don't set the queue, then the job will be submitted to the default queue and it will be rejected:

hadoop jar $HADOOP_HOME/hadoop-0.18.1-examples.jar wordcount /shared/enron-emails/full enron-wc
08/10/27 12:49:57 INFO mapred.FileInputFormat: Total input paths to process : 1
08/10/27 12:49:57 INFO mapred.FileInputFormat: Total input paths to process : 1
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot
access path hdfs://hadoophack.cloudera.com:9000/shared/enron-emails/full
from the default pool.

Tools

You can use any Hadoop tools to write your hack. For example, the Java MapReduce API, Streaming, Pig, Hive. Here are some notes on these APIs as they pertain to this cluster.

Pig

There is a pre-built version of Pig (svn revision 708935) suitable for use with this cluster. The pig.jar binary is stored in HDFS, and you can retrieve it with the following command:

hadoop fs -get /tools/pig/pig.jar pig.jar

Then to run Pig, you can just type:

java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main

and you should see it connect to the cluster.

See the Pig wiki for more.

Status Pages

There are status pages available for MapReduce and HDFS at http://server1.cloudera.com:50030/ and http://server1.cloudera.com:50070/ respectively. These only work if you have configured FoxyProxy correctly, and have your tunnel open to gateway.cloudera.com.

Getting Help

The email address hack-support@cloudera.com will reach Cloudera engineers who can help you with technical difficulties you encounter. We also have an IRC server running on gateway.cloudera.com:6667. We'll be in #hadoophack, and can help you there via internet chat. You are also all encouraged to idle there and help one another too :)

Also, sign up for our Google Group at http://groups.google.com/group/hadoop-hack.