Hadoop and the Cloudera Data Platform.
The first release (0.19.0) from the 0.19 branch of Hadoop Core was made on November 24. Many changes go into a release like this, and it can be difficult to get a feel for the more significant ones, even with the detailed Jira log, change log, and release notes. (There’s also JDiff documentation, which is a great way to see how the public API changed, via a JavaDoc-like interface.) This post gives a high-level feel for what’s new. (more…)
As a developer coming to Hadoop it is important to understand how testing is organized in the project. For the most part it is simple — it’s really just a lot of JUnit tests — but there are some aspects that are not so well known.
Running Hadoop Unit Tests
Let’s have a look at some of the tests in Hadoop Core, and see how to run them. First check out the Hadoop Core source, and from the top-level directory type the following command. (Warning: it will take a few hours to run the whole test suite, so you may not want to do this straight away.)
ant clean test
A few weeks ago we ran a Hadoop hackathon. ApacheCon participants were invited to use our 10-node Hadoop cluster to explore Hadoop and play with some datasets that we had loaded on in advance. One challenge we had to face was, how do we do this in a secure way? Hadoop does not offer much in the way of security. Hadoop provides a rudimentary file permission system on its distributed filesystem, HDFS, but does not verify the appropriateness of the username you are using. (Whatever username you use to start your local Hadoop client process is used as your HDFS username; this account does not necessarily need to exist on the machines which host the HDFS NameNode or DataNodes.)
Even more problematically, anyone who can connect to the JobTracker can submit arbitrary code to run with the authority of the account used to start the Hadoop TaskTrackers on each node.
While there is not a perfect solution to multitenancy in a Hadoop environment, by using a proxying gateway, you can at least control which users have access to your cluster. The rest of this post describes how to set up such a gateway configuration.
Hadoop was created by