Hadoop and HBase are gaining popularity due to their flexibility and tremendous work that has been done to simplify their installation and use. This blog is to provide guidance in sizing your first Hadoop/HBase cluster. First, there are significant differences in Hadoop and HBase usage. Hadoop MapReduce is primarily an analytic tool to run analytic and data extraction queries over all of your data, or at least a significant portion of them (data is a plural of datum). HBase is much better for real-time read/write/modify access to tabular data. Both applications are designed for high concurrency and large data sizes. For a general discussions about Hadoop/HBase architecture and differences please refer to Cloudera, Inc. [https://wiki.cloudera.com/display/DOC/Hadoop+Installation+Documentation+for+Cloudera+Enterprise, http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-hbase], or Lars George blogs [http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html]. We expect a new edition of the Tom White’s Hadoop book [http://www.hadoopbook.com] and a new HBase book in the near future as well.
With the recent release of CDH3b2, many users are more interested than ever to try out Cloudera’s Distribution for Hadoop (CDH). One of the questions we often hear is, “what does it take to migrate?”.
Why Migrate?
If you’re not familiar with CDH3b2, here’s what you need to know.
All versions of CDH provide:
- by Patrick Hunt
- July 12, 2010
- no comments
CDH3 beta 2 is the first version of CDH to incorporate Apache ZooKeeper. ZooKeeper is a highly reliable and available coordination service for distributed processes. It is a proven technology and a well established open source project at Apache (sub-project of Hadoop).
ZooKeeper is distributed coordination
Often distributed applications need some way to coordinate across processes; locking resources, managing queues of events, electing a “leader” process, configuration, etc… Coordination operations such as these are notoriously hard to get right. ZooKeeper provides a relatively simple API which allows clients to correctly implement these and many other coordination mechanisms.
ZooKeeper is itself a replicated service based on a quorum algorithm. One or more ZooKeeper servers form what’s called an “ensemble”, which are in constant communication. As the size of the ensemble increases the reliability of the service itself increases – as long as a majority of the configured ensemble servers are available the service is available. As an example, say you have an ensemble of size three (three ZooKeeper servers), if one of the three fail the service is still “up”. If two of the three fail the service is down. One could run with five servers, in which case if two servers fail the service as a whole would still be available. Seven server ensembles can survive three failures, and so on.
In my first few weeks here at Cloudera, I’ve been tasked with helping out with the Apache ZooKeeper system, part of the umbrella Hadoop project. ZooKeeper is a system for coordinating distributed processes. In a distributed environment, getting processes to act in any kind of synchrony is an extremely hard problem. For example, simply having a set of processes wait until they’ve all reached the same point in their execution – a kind of distributed barrier – is surprisingly difficult to do correctly. ZooKeeper offers an API to facilitate this sort of distributed coordination. For example, it is often used to serve locks to client processes – locks are just another kind of coordination primitive – in the form of small files that ZooKeeper tracks.
In order to be useful, ZooKeeper must be both highly reliable and available as systems will rely upon it as a critical component. For example, if locks cannot be taken, processes cannot make progress and the whole system will grind to a halt. ZooKeeper is built on a suite of reliable distributed systems techniques and protocols, and is typically run on a cluster of machines so that if some should fail, the remaining ones can continue to provide service. Under the hood, ZooKeeper is responsible for ordering calls made by clients so that each request is processed atomically and in a fixed and firm order.
One of my first contributions to the project was a set of bindings to allow programs written in the Python language to act as clients to a ZooKeeper cluster. ZooKeeper was natively written in Java, and there are already C and Perl bindings. Adding Python bindings increases the number of people that can use the system, and brings the strengths of Python, such as rapid prototyping, to bear when designing distributed systems.
Hadoop was created by