Developer Center
Cloudera Blog · Hadoop Posts

Hadoop World 2010: Speaker Highlights

Hadoop is increasingly being adopted by many Fortune 500 enterprises. Some of the speakers featured at Hadoop World this year include leading companies who have been able to create new value for their business using Hadoop. The presentations at Hadoop World are focused on how Hadoop is solving business problems for these enterprises.  Below are three examples of leading enterprises that will present how Hadoop has impacted their businesses.

GE, Product Manager, Linden Hillenbrand, will be talking about how Hadoop has improved GE’s Marketing & Communications functions.  One capability GE has implemented is assessing the external perception of GE–positive, neutral, or negative–through various marketing campaigns.

Using Hadoop for Fraud Detection and Prevention

Fraud has multiple meanings and the term can be easily abused.  The definition of fraud has undergone multiple changes throughout the years and is elusive as well as fraud itself.  The modern legal definition of fraud usually contains a few elements that have to be proven in court and depends on the state/country.  For example, in California, the elements of fraud, which give rise to the fraud cause of action in the California Courts, are: (a) misrepresentation (false representation, concealment, or nondisclosure); (b) knowledge of falsity (or scienter); (c) intent to defraud, i.e., to induce reliance; (d) justifiable reliance; and (e) resulting damage.  A more general definition may contain up to 9 elements.

From the statistical or technical perspective, fraud is a rare event that results in a significant financial impact to the organization.

Both definitions emphasize that the event is rare (assuming that most of the population is law-abiding citizens), is intentional (there is no “accidental” fraud), as well as imply a significant damage caused to the defrauded party (otherwise why bother).  Fraud detection is difficult from statistical point of view for exactly these reasons: (a) the events are rare and it is difficult to build a predictive model and (b) fraud assumes a real human being behind it and incorporates elements of game theory since the fraudster is often an insider who knows how to game the system.

Hadoop World: NYC – Training

Our vision for Hadoop World is a conference where both newcomers and experienced Hadoop users can learn and be part of the growing Hadoop community.

We are also offering training sessions for newcomers and experienced Hadoop users alike. Whether you are looking for an Introduction to Hadoop, Hadoop Certification, or you want to learn more about related Hadoop projects we have the training you are looking for.

Hadoop/HBase Capacity Planning

Hadoop and HBase are gaining popularity due to their flexibility and tremendous work that has been done to simplify their installation and use.  This blog is to provide guidance in sizing your first Hadoop/HBase cluster.  First, there are significant differences in Hadoop and HBase usage.  Hadoop MapReduce is primarily an analytic tool to run analytic and data extraction queries over all of your data, or at least a significant portion of them (data is a plural of datum).  HBase is much better for real-time read/write/modify access to tabular data.  Both applications are designed for high concurrency and large data sizes.  For a general discussions about Hadoop/HBase architecture and differences please refer to Cloudera, Inc. [https://wiki.cloudera.com/display/DOC/Hadoop+Installation+Documentation+for+Cloudera+Enterprise, http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-hbase], or Lars George blogs [http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html].  We expect a new edition of the Tom White’s Hadoop book [http://www.hadoopbook.com] and a new HBase book in the near future as well.

Announcing Two New Training Classes from Cloudera: Introduction to HBase and Analyzing Data with Hive and Pig

Cloudera is pleased to announce two new training courses: a one-day Introduction to HBase and a two-day session on Analyzing Data with Hive and Pig. These join a recently-expanded two-day Hadoop for Administrators course and our popular three-day Hadoop for Developers offering, any of which can be combined to provide extensive, customized training for your organization. Please contact sales@cloudera.com for more information regarding on-site training, or visit www.cloudera.com/hadoop-training to view our public course schedule.

Cloudera’s HBase course discusses use-cases for HBase, and covers the HBase architecture, schema modeling, access patterns, and performance considerations. During hands-on exercises, students write code to access HBase from Java applications, and use the HBase shell to manipulate data. Introduction to HBase also covers deployment and advanced features.

Our Hive and Pig course is designed for developers who are skilled with SQL or scripting languages, but who are not Java experts. Hive and Pig are two approaches which allow non-Java programmers to access and manipulate massive amounts of data while abstracting away the complexities of MapReduce. Hive offers an SQL-like interface, while Pig’s scripting language, named PigLatin, is very easy for developers learn. This course covers both technologies, and includes multiple hands-on exercises to reinforce key concepts.

More on Cloudera’s Distribution for Hadoop 3
Integrating Hive and HBase

This post was contributed by John Sichi, a committer on the Apache Hive project and a member of the Data Infrastructure team at Facebook.

As many readers may already know, Hive was initially developed at Facebook for dealing with explosive growth in our multi-petabyte data warehouse.  Since its release as an Apache project, it has been put into use at a number of other companies for solving big data problems.  Hive storage is based on Hadoop’s underlying append-only filesystem architecture, meaning that it is ideal for capturing and analyzing streams of events (e.g. web logs).  However, a data warehouse also has to relate these event streams to application objects; in Facebook’s case, these include familiar items such as fan pages, user profiles, photo albums, or status messages.

Hive can store this information easily, even for hundreds of millions of users, but keeping the warehouse up to date with the latest information published by users can be a challenge, as the append-only constraint makes it impossible to directly apply individual updates to warehouse tables.  Up until now, the only practical option has been to periodically pull snapshots of all of the information from live MySQL databases and dump them to new Hive partitions.  This is a costly operation, meaning it can be done at most daily (leading to stale data in the warehouse), and does not scale well as data volumes continue to shoot through the roof.

That’s where HBase comes in.  HBase is a scaleout table store which can support a very high rate of row-level updates over massive amounts of data.  It sidesteps Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging intelligently based on data distribution changes.  Since it is based on Hadoop, making HBase interoperate with Hive is straightforward, meaning HBase tables can be accessed as if they were native Hive tables.  As a result, a single Hive query can now perform complex operations such as join, union, and aggregation across combinations of HBase and native Hive tables.  Likewise, Hive’s INSERT statement can be used to move data between HBase and native Hive tables, or to reorganize data within HBase itself.

Get Hadoop Training from Cloudera at the Hadoop Summit

We love getting together with other Hadoop fans and fanatics! We’ve put together new training offerings for this years upcoming Hadoop Summit in June, and we’ve worked out a special deal with Yahoo! to waive the conference registration fee for anyone who attends a Cloudera training session at the 2010 Hadoop Summit (you’ll get a discount code for training in your conference registration confirmation). In addition to our developer certification course, we’ll offer an extended version of our Systems Administration course, as well as new, full-day course on HBase. One particularly exciting new offering is our full-day course on Hive, which opens Hadoop up to anyone who knows SQL.

All of these offerings are driven by direct customer feedback about what their organizations need to be even more successful with Hadoop, and we’re excited to help. We look forward to seeing you there.

Introducing Cloudera Desktop

Today at Hadoop World NYC, we’re announcing the availability of Cloudera Desktopa unified and extensible graphical user interface for Hadoop. The product is free to download and can be used with either internal clusters or clusters running on public clouds.

At Cloudera, we’re focused on making Hadoop easy to install, configure, manage, and use for all organizations. While there exist many utilities for developers who work with Hadoop, Cloudera Desktop is targeting beginning developers and non-developers in an organization who’d like to get value from the data stored in their Hadoop cluster. By working within a web browser, users avoid the tedious client installation and upgrade cycle, and system administrators avoid custom firewall configurations.

We’ve worked closely with the MooTools community to create a desktop environment inside of a web browser that should be familiar to navigate for most users. The desktop environment has other advantages: it’s extensible to hundreds of applications and allows for data to be shared between applications.

The Project Split

Last Wednesday, we hosted a Hadoop meetup, and I gave a short talk about the new project split. How does the split change the project’s organization, and what does it mean for end users?

The mailing lists and the source code repositories have been rearranged. For those doing development against Hadoop’s “trunk” branch, compiling Hadoop and using the various components in concert has become more complicated.

My presentation slides cover which mailing lists to subscribe to, where the source repositories are located, and how to compile and run the development version of Hadoop.