The Hadoop hack is being run by Cloudera on a dedicated small cluster of machines running Hadoop 0.18.1. If you have not used Hadoop before, you can read more about it on the Hadoop website.
During the hack you will be able to run your own MapReduce programs against the data on the cluster -- hopefully these programs will produce something cool or of interest! There are prizes for the best hacks.
JAVA_HOME to point to a JDK 6 installation, HADOOP_HOME to point to the top-level of the Hadoop distribution, and finally add $HADOOP_HOME/bin to your PATH.YOUR_USER_NAME with your username, as shown in the top left-hand corner of this page.hadoop fs -ls /user/username
If you get the message No such file or directory then your directory has not been created: please email hack-support@cloudera.com asking to have it set up before proceeding.
hadoop jar $HADOOP_HOME/hadoop-0.18.1-examples.jar wordcount /shared/enron-emails/small/a* sanity-check
You can track the job using the job tracker page at http://server1.cloudera.com:50030/. It should run without errors.
For this hack we have the following datasets available. Please read the documentation for each dataset before using it, to find out more about the format of the data, and some ideas for what you might do with it.
| Dataset | What is it? | Approximate size | HDFS Path | Notes |
|---|---|---|---|---|
| Enron email | Emails from 150 employees of Enron, mostly executives | 1.3GB | /shared/enron-emails | Documentation |
| NCDC weather records | 100 years of global weather station readings | 138GB | /shared/ncdc-weather | Documentation |
| Spinn3r web crawl | A large chunk of the spinn3r blog corpus. | 100GB | /shared/spinn3r | Documentation |
| TIGER | US topological data | 0.5GB | /shared/tiger | Documentation |
| Wikipedia page contents | Current versions of each page on wikipedia | 18GB | /shared/wiki-articles | Documentation |
| Wikipedia page contents with history | All revision history for each page on wikipedia | 1TB | /shared/wiki-history | Documentation |
Each dataset comes in smaller cut down versions ("small" and "micro") to facilitate testing. You are encouraged to download the "micro" version of each so you can unit test your code on your own machine. When the tests are passing, try running against the "small" version on the cluster. When you are happy with these results, try running against the full dataset. Remember that the cluster is a shared resource, so try to limit the number of full runs that you do to the minimum necessary.
To run against a full dataset, you need to submit your job to a special queue (this is so we can monitor who is running jobs against full datasets). To do this you need to specify the Hadoop property queue.name to be large-datasets. For example:
hadoop jar $HADOOP_HOME/hadoop-0.18.1-examples.jar wordcount -D queue.name=large-datasets /shared/enron-emails/full enron-wc
If you don't set the queue, then the job will be submitted to the default queue and it will be rejected:
hadoop jar $HADOOP_HOME/hadoop-0.18.1-examples.jar wordcount /shared/enron-emails/full enron-wc 08/10/27 12:49:57 INFO mapred.FileInputFormat: Total input paths to process : 1 08/10/27 12:49:57 INFO mapred.FileInputFormat: Total input paths to process : 1 org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot access path hdfs://hadoophack.cloudera.com:9000/shared/enron-emails/full from the default pool.
You can use any Hadoop tools to write your hack. For example, the Java MapReduce API, Streaming, Pig, Hive. Here are some notes on these APIs as they pertain to this cluster.
There is a pre-built version of Pig (svn revision 708935) suitable for use with this cluster. The pig.jar binary is stored in HDFS, and you can retrieve it with the following command:
hadoop fs -get /tools/pig/pig.jar pig.jar
Then to run Pig, you can just type:
java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main
and you should see it connect to the cluster.
See the Pig wiki for more.
There are status pages available for MapReduce and HDFS at http://server1.cloudera.com:50030/ and http://server1.cloudera.com:50070/ respectively. These only work if you have configured FoxyProxy correctly, and have your tunnel open to gateway.cloudera.com.
The email address hack-support@cloudera.com will reach Cloudera engineers who can help you with technical difficulties you encounter. We also have an IRC server running on gateway.cloudera.com:6667. We'll be in #hadoophack, and can help you there via internet chat. You are also all encouraged to idle there and help one another too :)
Also, sign up for our Google Group at http://groups.google.com/group/hadoop-hack.