Developer Center
Cloudera Blog · Careers Posts
Caching in HBase: SlabCache

This was my summer internship project at Cloudera, and I’m very thankful for the level of support and mentorship I’ve received from the HBase community. I started off in June with a very limited knowledge of both HBase and distributed systems in general, and by September, managed to get this patch committed to HBase trunk. I couldn’t have done this without a phenomenal amount of help from Cloudera and the greater HBase community.

Background

The amount of memory available on a commodity server has increased drastically in tune with Moore’s law. Today, its very feasible to have up to 96 gigabytes of RAM on a mid-end, commodity server. This extra memory is good for databases such as HBase which rely on in memory caching to boost read performance.

However, despite the availability of high memory servers, the garbage collection algorithms available on production quality JDK’s have not caught up. Attempting to use large amounts of heap will result in the occasional stop-the-world pause that is long enough to cause stalled requests and timeouts, thus noticeably disrupting latency sensitive user applications.

Garbage Collection

How I found Hadoop

This is a guest post contributed by Loren Siebert. Loren is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.

A year ago I rolled my first Hadoop system into production. Since then, I’ve spoken to quite a few people who are eager to try Hadoop themselves in order to solve their own big data problems. Despite having similar backgrounds and data problems, few of these people have sunk their teeth into Hadoop. When I go to Hadoop Meetups in San Francisco, I often meet new people who are evaluating Hadoop and have yet to launch a cluster. Based on my own background and experience, I have some ideas on why this is the case.

I studied computer science in school and have worked on a wide variety of computer systems in my career, with a lot of focus on server-side Java. I learned a bit about building distributed systems and working with large amounts of data when I built a pay-per-click (PPC) ad network in 2004. The system is still in operation and at one point was handling several thousand searches per second. As the sole technical resource on the system, I had to educate myself very quickly about how to scale up.

My Internship at Cloudera

David joined us as part of our intern program, and built the prototype for the distributed log search functionality that’s available as part of Cloudera Manager 3.7. He did an awesome job, and wrote the following blog post which, now that CM3.7 has been released, we’re pleased to publish.

The project

My intern project was to build a log searching tool, specialized for Apache Hadoop. My mini-app allows Hadoop cluster admins and operators to search their error logs across many machines, filter by time range, text in the log message, and find the namenode machine, for example. The results are then ordered by time, and shown to the user.

This project was inspired by the extreme wizardry required to search logs with traditional tools, such as grep and ssh (or parallel ssh), especially since these tools do not order the results by time. Ordering by time is very important, as it allows one to triage the sources of failures across your cluster, and figure out where it all started.

How do I feel about my project in retrospect?

Hadoop World 2011: A Glimpse into Development

The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.

Preview of Development Track Sessions

Building Web Analytics Processing on Hadoop at CBS Interactive
Michael Sun, CBS Interactive

My Summer Internship at Cloudera

This post was written by Daniel Jackoway following his internship at Cloudera during the summer of 2011.

When I started my internship at Cloudera, I knew almost nothing about systems programming or Apache Hadoop, so I had no idea what to expect. The most important lesson I learned is that structured data is great as long as it is perfect, with the addendum that it is rarely perfect.

My project was to develop a unified view of our customer data. The requirements were simple: pull in data from a variety of systems, group it by customer, and display it. The goal is that when someone at Cloudera needs to see all of the key information about our customers, it is available in one place. In addition, downloading and grouping data will make performing analysis much easier, allowing us to draw new insights about our business and our customers.

Cloudera in The Cube with Silicon Angle TV at Strata Conference 2011

The consensus from the Cloudera attendees of the O’Reilly Strata Conference last week was that the data-focused conference was nearly pitch perfect for the data scientist, practitioners and enthusiast who attended the event. It was filled with educational and sometimes entertaining sessions, provided ample time for mingling with vendors and attendees and was well run in general.

One of the cool activities happening at the conference was live streaming video brought to us from the good folks at SiliconAngle. Using a mobile production system called The Cube, Silicon Angle hosts John Furrier (@furrier) and Dave Vellante interviewed industry luminaries and up and comers while bringing their own perspective. After streaming live for nearly two days these hosts are still able to keep the energy high and the tone light.

In the interviews below John and Dave interview Amr Awadallah, CTO and Co-Founder of Cloudera (@awadallah), and John Kreisa, VP Marketing at Cloudera (@marked_man); followed by a John and Dave interview with Sarah Sproehnle director of education at Cloudera. During the interviews they cover many different aspects of Cloudera and Apache Hadoop.

Interview 1

Top 10 Blog Posts of 2010

We blogged about 104 different topics in 2010 and we recently decided to take a look back and see what folks were most interested in reading.  The topics that were featured ranged from Cloudera’s Distribution for Apache Hadoop technical updates (CDH3b3 being the most recent) to highlighting upcoming Hadoop related events and activities to sharing practical insights for implementing Hadoop. We also featured a number of guest blog posts.

Here are the top 10 blog posts from 2010:

  1. How to Get a Job at Cloudera
    Cloudera is hiring around the clock, and this blog highlights the best course of action to increase your chances of becoming a Clouderan.
  2. Why Europe’s Largest Ad Targeting Platform Uses Hadoop
    “As data volumes increased and performance suffered, we recognized a new approach was needed (Hadoop).” –Richard Hutton, Nugg.ad CTO
  3. What’s New in CDH3b2 Flume
    Flume, our data movement platform, was introduced to the world and into the open source environment.
  4. What’s New in CDH3b2 Hue
    Hue, a web UI for Hadoop, is a suite of web applications as well as a platform for building custom applications with a nice UI library.
  5. Natural Language Processing with Hadoop and Python
    Data volumes are increasing naturally from text (blogs) and speech (YouTube videos) posing new questions for Natural Language Processing. This involves making sense of lots of data in different forms and extracting useful insights.
  6. How Raytheon BBN Technologies Researchers are Using Hadoop to Build a Scalable, Distributed Triple Store
    Raytheon BBN Technologies built a cloud-based triple-store technology, known as SHARD, to address scalability issues in the processing and analysis of Semantic Web data.
  7. Cloudera’s Support Team Shares Some Basic Hardware Recommendations
    The Cloudera support team discusses workload evaluation and the critical role it plays in hardware selection.
  8. Integrating Hive and HBase
    Facebook explains integrating Hive and HBase to keep their warehouse up to date with the latest information published by users.
  9. Pushing the Limits of Distributed Processing
    Google built a 100,000 node Hadoop cluster running on Nexus One mobile phone hardware and powered by Android. The environmental cost of this solution is 1/100th the equivalent of running it within their data center. (April Fools)
  10. Using Flume to Collect Apache 2 Web Server Logs
    This post presents the common use case of using a Flume node to collect Apache 2 web server logs and deliver them to HDFS.

Aside from How to Get a Job at Cloudera, Cloudera blog readers viewed posts related to CDH and its components, posts exemplifying possibilities with Hadoop in production, and posts highlighting integrations with Hadoop.

Cloudera Fun & Frightful Halloween Festivities

Here at Cloudera we embraced the holiday spirit with the light heartedness that is Halloween by hosting several activities including an engineering hack-a-thon, a hack-a-pumpkin-a-thon, and a costume competition.

Cloudera Corporate & a chicken, aka Cloudera engineers at their finest.

What is in our Kitchen?

If there is one thing that chefs are proud of, it’s their kitchens. Whether cavernous top-of-the-line affairs or cramped New York apartments, kitchens are the place where raw ingredients are combined with talent and hard work to produce results. The only difference in the world of software is what you will find in our kitchens.  (more…)

How to Get a Job at Cloudera

We’re doing a lot of hiring at Cloudera — we have jobs open in operations, sales, engineering and elsewhere. Hiring well is hard work. We spend a lot of time on it, and have learned a lot about the kind of people we want to bring in. One of the best ways for us to do a good job of hiring is to help you do a good job of applying for a job here.

I’ll begin the post, though, by telling you what doesn’t work. Several times a day, we get an unsolicited email or phone message from a contingency recruiter like this one:

I specialize in the industry and wanted to contact you to let you know that I have a strong candidate for your [deleted] position, and wanted to know if you would like to review the resume that I have? My candidate is interested in interviewing as soon as possible.