Developer Center
Cloudera Blog

January 2012 Bay Area HBase User Group meetup summary + HBaseCon announcement

More than 150 people attended the San Francisco Bay Area HBase User Group meetup last Thursday, January 19th, at eBay headquarters in San Jose, California.  Presenters from StumbleUpon, Facebook, eBay and MapR shared a wealth of information about Apache HBase operations and optimizations, gleaned from their experience running HBase in production environments.

One special item of note: Michael Stack announced HBaseCon 2012, taking place this spring in the Bay Area.  This inaugural conference will focus on the growth and education of the HBase community.  While details of the event are not yet published, the call for speakers is currently open.  Submit your abstract here.

Many of the talks focused on HBase operations.  Here’s a summary of those presentations:

Seismic Data Science: Reflection Seismology and Hadoop

When most people first hear about data science, it’s usually in the context of how prominent web companies work with very large data sets in order to predict clickthrough rates, make personalized recommendations, or analyze UI experiments. The solutions to these problems require expertise with statistics and machine learning, and so there is a general perception that data science is intimately tied to these fields. However, in my conversations at academic conferences and with Cloudera customers, I have found that many kinds of scientists– such as astronomers, geneticists, and geophysicists– are working with very large data sets in order to build models that do not involve statistics or machine learning, and that these scientists encounter data challenges that would be familiar to data scientists at Facebook, Twitter, and LinkedIn.

The Practice of Data Science

The term “data science” has been subject to criticism on the grounds that it doesn’t mean anything, e.g., “What science doesn’t involve data?” or “Isn’t data science a rebranding of statistics?” The source of this criticism could be that data science is not a solitary discipline, but rather a set of techniques used by many scientists to solve problems across a wide array of scientific fields. As DJ Patil wrote in his excellent overview of building data science teams, the key trait of all data scientists is the understanding “that the heavy lifting of [data] cleanup and preparation isn’t something that gets in the way of solving the problem: it is the problem.”

I have found a few more characteristics that apply to the work of data scientists, regardless of their field of research:

  1. Inverse problems. Not every data scientist is a statistician, but all data scientists are interested in extracting information about complex systems from observed data, and so we can say that data science is related to the study of inverse problems. Inverse problems arise in almost every branch of science, including medical imaging, remote sensing, and astronomy. We can also think of DNA sequencing as an inverse problem, in which the genome is the underlying model that we wish to reconstruct from a collection of observed DNA fragments. Real-world inverse problems are often ill-posed or ill-conditioned, which means that scientists need substantive expertise in the field in order to apply reasonable regularization conditions in order to solve the problem.
  2. Data sets that have a rich set of relationships between observations. We might think of this as a kind of Metcalfe’s Law for data sets, where the value of a data set increases nonlinearly with each additional observation. For example, a single web page doesn’t have very much value, but 128 billion web pages can be used to build a search engine. A DNA fragment in isolation isn’t very useful, but millions of them can be combined to sequence a genome. A single adverse drug event could have any number of explanations, but millions of them can be processed to detect suspicious drug interactions. In each of these examples, the individual records have rich relationships that enhance the value of the data set as a whole.
  3. Open-source software tools with an emphasis on data visualization. One indicator that a research area is full of data scientists is an active community of open source developers. The R Project is a widely known and used toolset that cuts across a variety of disciplines, and has even been used as a basis for specialized projects like Bioconductor. Astronomers have been using tools like AIPS for processing data from radio telescopes and IRAF for data from optical telescopes for more than 30 years. Bowtie is an open source project for performing very fast DNA sequence alignment, and the Crossbow Project combines Bowtie with Apache Hadoop for distributed sequence alignment processing.
Apache HBase 0.92.0 has been released

Today the Apache HBase community has proudly released Apache HBase 0.92.0, a major new version of the scalable distributed data store inspired by Google’s BigTable.  Over 670 issues were addressed, so in this post I’ll highlight some of the major features and enhancements and describe what they mean for HBase users, admins, and developers.

User Features

While the most visible change to the project is the new project logo, the most important changes for users are the performance and robustness improvements to HBase’s core functionality. On the performance side, there are a few major highlights:

Hadoop World 2011 Videos and Slides Available

Last November in New York City, Hadoop World, the largest conference of Apache Hadoop practitioners, developers, business executives, industry luminaries and innovative companies took place. The enthusiasm for the possibilities in Big Data management and analytics with Hadoop was palpable across the conference. Cloudera CEO, Mike Olson, summarizes Hadoop World 2011 in these final remarks.

Those who attended Hadoop World know how difficult navigating a route between two days of five parallel tracks of compelling content can be—particularly since Hadoop World 2011 consisted of sixty-five informative sessions about Hadoop. Understanding that it is nearly impossible to obtain and/or retain all the valuable information shared live at the event, we have compiled all the Hadoop World presentation slides and videos for perusing, sharing and for reference at your convenience. You can turn to these resources for technical Hadoop help and real-world production Hadoop examples, as well as information about advanced data science analytics.

I’d like to take this opportunity to again thank all who participated in making Hadoop World 2011 a success: sponsors, attendees, the Sheraton New York Hotel & Towers and the Hadoop World production team.

Hadoop World Resource Links

Apache Sqoop: Highlights of Sqoop 2

This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop

Apache Sqoop (incubating) was created to efficiently transfer bulk data between Hadoop and external structured datastores, such as RDBMS and data warehouses, because databases are not easily accessible by Hadoop. Sqoop is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.

The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. That said, to enhance its functionality, Sqoop needs to fulfill data integration use-cases as well as become easier to manage and operate.

What is Sqoop?

Capacity Planning with Cloudera Manager

If you’re like a myriad of other systems administrators out there, you may be running a production Hadoop cluster, spec’ing one out, or just starting to investigate the possibility of bringing Hadoop into your workplace. As any of these folks will be able to tell you, one of the most important tasks you’ll encounter is capacity planning. With the release of Cloudera Manager 3.7, we’re bringing you a new set of tools to aid you in this process. In this post, we’ll take a look at how you can leverage Cloudera Manager to deal with some common scenarios that you might run into while planning out a Hadoop cluster.

Questions and Patterns

How is my disk usage growing over time?

One very interesting disk usage pattern can be seen in Josh’s recent blog post on his analysis of drug interactions. Josh started with a relatively small data set, containing about one million records. However, during one of the stages of his analytic process, the number of records was blown up from one million to three trillion. Many types of analyses can result in very large intermediate data sets, while the final output may just be a fraction of the intermediate data. The consequence is that there are temporary spikes in disk usage, which need to be understood, in order to appropriately plan out a Hadoop deployment.

Cloudera Manager – Thank You Customers!

Bala Venkatrao is the Director of Product Management at Cloudera.

As many of you know, we recently launched Cloudera Enterprise 3.7. Here’s the link to the press release This release marked a transition from Cloudera Management Suite (CMS) to Cloudera Manager (CM), the industry’s first and most comprehensive management application for Apache Hadoop. Over the last month we have received very positive feedback from our customers. I want to thank again all the Clouderans who spent countless hours bringing this product to market. I also want to take this opportunity to thank our customers for helping us get here, as many of them helped us to prioritize the key features for this release. Several customers have also shared the challenges/use cases from their Hadoop deployments and the need for specific features (more later) in Cloudera Manager. Many customers were actively involved in usability testing sessions for Cloudera Manager, which were immensely helpful!

At Cloudera, we strive hard to listen to our customers and help build products to address their needs.  We hold regular meetings with customers, sharing early design prototypes and feature ideas and then quickly iterate on the feedback we receive. Cloudera Manager has been a result of this amazing collaboration with our customers and we look forward to this continued partnership as we build on our vision to make it even easier for our customers to manage their Hadoop environments.

Oracle selects CDH and Cloudera Manager as the Apache Hadoop Platform for the Oracle Big Data Appliance

Cloudera users gain more choice, tighter Oracle integration. Cloudera partners gain increased validation of their platform choice.

Ed Albanese
Ed leads business development for Cloudera. He is responsible for identifying new markets, revenue opportunities and strategic alliances for the company.

Summary: Oracle has selected Cloudera’s Distribution Including Apache Hadoop (CDH) and Cloudera Manager software as core technologies on the Oracle Big Data Appliance, a high performance “engineered system.” Oracle and Cloudera announced a multiyear agreement to provide CDH, Cloudera Manager, and support services in conjunction with Oracle Support for use on the Oracle Big Data Appliance.

Hadoop Selected for the InfoWorld 2012 Technology of the Year Award

Great news! The InfoWorld Tech Center has chosen Apache Hadoop for a 2012 Technology of the Year Award. Judged by InfoWorld Test Center editors and reviewers, the annual awards identify the best and most innovative products on the IT landscape. Winners are drawn from all of the products tested during the past year, with the final selections made by InfoWorld’s Test Center staff. All products reviewed by the Test Center are eligible to win, and we at Cloudera are very excited that Hadoop was named among the finalists.

I joined Cloudera in 2011 and it’s been very exciting for me to join the Hadoop community and participate in what, by all accounts, was a landmark year. It’s been fantastic to see how Hadoop has empowered companies of every size and in every industry to do new and interesting things with their data. And this is just the beginning. 2012 promises to bring even more innovation and great use cases for this game-changing platform.

The impact of Hadoop is not just on the financial statements of corporate America. In addition to opening up new revenue streams for companies and helping them achieve ever greater levels of operational efficiency, I am particularly impressed with how Hadoop has enabled the use of data for the betterment of society and the human experience. There are so many ways Hadoop has already made a difference in our lives. Here are some of my favorite examples:

Hadoop in 2011

2011 was a breakthrough year for Apache Hadoop as many more mainstream organizations large and small turned to Hadoop to manage and process Big Data, while enterprise software and hardware vendors have also made Hadoop a prominent part of their offerings. Big Data and Hadoop became synonymous in much of the enterprise discourse, and Big Data interest is not restricted to Big Companies.

Apache Hadoop Releases

Hadoop had three major releases in 2011: 1.0 (AKA 0.20.205.x), 0.22, and 0.23.

1.0.0 adds HDFS support for HBase, Webhdfs, and HDFS performance improvements