Developer Center
Cloudera Blog · MapReduce Posts

Using Hadoop for Fraud Detection and Prevention

Fraud has multiple meanings and the term can be easily abused.  The definition of fraud has undergone multiple changes throughout the years and is elusive as well as fraud itself.  The modern legal definition of fraud usually contains a few elements that have to be proven in court and depends on the state/country.  For example, in California, the elements of fraud, which give rise to the fraud cause of action in the California Courts, are: (a) misrepresentation (false representation, concealment, or nondisclosure); (b) knowledge of falsity (or scienter); (c) intent to defraud, i.e., to induce reliance; (d) justifiable reliance; and (e) resulting damage.  A more general definition may contain up to 9 elements.

From the statistical or technical perspective, fraud is a rare event that results in a significant financial impact to the organization.

Both definitions emphasize that the event is rare (assuming that most of the population is law-abiding citizens), is intentional (there is no “accidental” fraud), as well as imply a significant damage caused to the defrauded party (otherwise why bother).  Fraud detection is difficult from statistical point of view for exactly these reasons: (a) the events are rare and it is difficult to build a predictive model and (b) fraud assumes a real human being behind it and incorporates elements of game theory since the fraudster is often an insider who knows how to game the system.

Hadoop Administrator Training Comes to London

Cloudera’s Hadoop Training and Certification for System Administrators has made it across the Atlantic to London for the first time! This two-day course covers planning, deploying, maintaining, monitoring, and troubleshooting your Hadoop cluster. We’ll talk about HDFS, MapReduce, Hive, Pig, HBase, Flume and more, from the System Administrator’s point of view. Take the certification exam at the end of your training and go home with a valuable validation of your Hadoop knowledge.

Enter the code “london_10pct” when registering and receive a 10% discount!

Hadoop World: NYC – Training

Our vision for Hadoop World is a conference where both newcomers and experienced Hadoop users can learn and be part of the growing Hadoop community.

We are also offering training sessions for newcomers and experienced Hadoop users alike. Whether you are looking for an Introduction to Hadoop, Hadoop Certification, or you want to learn more about related Hadoop projects we have the training you are looking for.

Hadoop/HBase Capacity Planning

Hadoop and HBase are gaining popularity due to their flexibility and tremendous work that has been done to simplify their installation and use.  This blog is to provide guidance in sizing your first Hadoop/HBase cluster.  First, there are significant differences in Hadoop and HBase usage.  Hadoop MapReduce is primarily an analytic tool to run analytic and data extraction queries over all of your data, or at least a significant portion of them (data is a plural of datum).  HBase is much better for real-time read/write/modify access to tabular data.  Both applications are designed for high concurrency and large data sizes.  For a general discussions about Hadoop/HBase architecture and differences please refer to Cloudera, Inc. [https://wiki.cloudera.com/display/DOC/Hadoop+Installation+Documentation+for+Cloudera+Enterprise, http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-hbase], or Lars George blogs [http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html].  We expect a new edition of the Tom White’s Hadoop book [http://www.hadoopbook.com] and a new HBase book in the near future as well.

Migrating to CDH

With the recent release of CDH3b2, many users are more interested than ever to try out Cloudera’s Distribution for Hadoop (CDH). One of the questions we often hear is, “what does it take to migrate?”.

Why Migrate?

If you’re not familiar with CDH3b2, here’s what you need to know.

All versions of CDH provide:

What’s New in CDH3b2: Oozie

Hadoop has emerged as an indispensable component of any data-intensive enterprise infrastructure.  In many ways, working with large datasets on a distributed computing platform (powered by commodity hardware or cloud infrastructure) has never been easier. But because customers are running clusters consisting of hundreds or thousands of nodes, and are processing massive quantities of data from production systems every hour, the logistics of efficient platform utilization can quickly become overwhelming.

To deal with this challenge, the Yahoo! engineering team created Oozie – the Hadoop workflow engine. We are pleased to provide Oozie with Cloudera’s distribution for Hadoop starting with the beta-2 release.

Why create a new workflow system?

You might wonder why a new workflow system is necessary for Hadoop, given that there are quite a few existing commercial and open-source systems available.  While it is possible to use existing general-purpose workflow systems with Hadoop, it is anything but simple. Intricacies such as monitoring long running jobs and interfacing with the distributed file system require extensive work to port general workflow systems to the Hadoop environment. Oozie, on the other hand, is designed specifically for the Hadoop platform and uses it as its execution environment. It has built-in support for Hadoop tasks and integrates with this environment cleanly. Oozie itself is fairly light-weight, requires minimal configuration, and scales linearly – thus offering a sustainable approach to building workflows in the Hadoop environment.

CDH2 Update 1 Now Available

Cloudera is happy to announce the availability of the first update to version 2 of our distribution for Hadoop. While major new features are planned for our release of version 3 we will regularly update version 2 with improvements and bug fixes. Check out the change log and release notes for details. You can find the packages and tarballs on our website, or simply update if you are already using our yum and apt repositories.

A notable addition in update 1 is a FUSE package for HDFS. This package allows you to easily mount HDFS as a standard file system for use with traditional Unix utilities. Check out the Mountable HDFS section in the CDH docs and the hadoop-fuse-dfs manpage for details.

We appreciate feedback! Get in touch with us on Get Satisfaction, twitter and IRC (#cloudera on freenode.net) and let us know how the update is working for you.

Highlights from the First Hadoop Contributors Meeting

While the vast majority of the Hadoop development discussion takes place on the Apache Jira and various project mailing lists, it’s often useful to meet face to face for high bandwidth discussion. To that end, Facebook hosted the first Hadoop contributors meeting yesterday at their campus in Palo Alto. Cloudera, Facebook, Yahoo! and the HBase team were well-represented. It was great to see a broad cross section of Hadoop developers in one room. Contributor meetings will be held on a monthly basis, at a rotating location. While any Hadoop project contributor is welcome to attend, the current focus of the meetings is HDFS and MapReduce. The goal of the discussion is to surface and flesh out ideas rather than make decisions, which happens on the development lists. If you’ve got ideas to add check out the meeting notes and continue the discussion.

Sanjay Radia kicked off the meeting with a discussion of development priorities. Hadoop has become a platform and industry standard for data storage and analytics. What advances are most important to users? How do we continue to innovate without disrupting the installed base? Development must maintain and improve the quality that has allowed companies to adopt Hadoop in their production environments. Fortunately there is broad agreement among contributors on development priorities: availability, compatibility, security, scalability and performance.

The next topic was releases. Tom White gave an update on the upcoming 0.21 release. The release has branched and is now in feature freeze; the focus is now shifting to closing out issues that block the release. You can hear more about the 0.21 release at the next Bay Area Hadoop user group. Chris Douglas recently proposed Hadoop adopt release managers. Owen O’Malley led a discussion on the role and responsibilities of the release manager, and the Hadoop release process.

CDH2: Testing Release now with Pig, Hive, and HBase

At the beginning of September, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable.

We plan on pushing new packages into the testing repository every 3 to 6 weeks.  And it just so happens it is just about 3 weeks after we announced the first testing release. So it must be time for a new one. Here are some of the highlights:

Advice on QA Testing Your MapReduce Jobs

As Hadoop adoption increases among organizations, companies, and individuals, and as it makes its way into production, testing MapReduce (MR) jobs becomes more and more important. By regularly running tests on your MR jobs–either invoked by developers before they commit a change or by a continuous integration server such as hudson–an engineering organization can catch bugs early, strive for quality, and make developing and maintaining MR jobs easier and faster.

MR jobs are particularly difficult to test thoroughly because they run in a distributed environment.  This post will give specific advice on how an engineering team might QA test its MR jobs. Note that Chapter 5 of Hadoop: The Definitive Guide gives specific code examples for testing an MR job.

As is the case with most testing scenarios, there are certain practices one can follow that have a low barrier to entry; such practices might do a fairly sufficient job of testing. There are also practices one can follow that are more complicated but perhaps result in more thorough testing. Let’s walk through some good QA practices, starting with the easiest and ending with the most complicated.

Traditional Unit Tests – JUnit, PyUnit, Etc.