Developer Center
Cloudera Blog · Guest Posts
How Treato Analyzes Health-related Social Media Big Data with Hadoop and HBase

This is a guest post by Assaf Yardeni, Head of R&D for Treato, an online social healthcare solution, headquartered in Israel.

Three years ago I joined Treato, a social healthcare analysis firm to help treato.com scale up to its present capability. Treato is a new source for healthcare information where health-related user generated content (UGC) from the Internet is aggregated and organized into usable insights for patients, physicians and other healthcare professionals. With oceans of patient-written health-related information available on the Web, and more being published each day, Treato needs to be able to collect and process vast amounts of data – Treato is Big Data par excellence, and my job has been to bring Treato to this stage.

Before the Hadoop era

When I arrived at Treato, the team had already developed a Microsoft-based prototype that could organize a limited amount of health-related UGC into relevant insights, as a proof of concept. The system would:

How I found Hadoop

This is a guest post contributed by Loren Siebert. Loren is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.

A year ago I rolled my first Hadoop system into production. Since then, I’ve spoken to quite a few people who are eager to try Hadoop themselves in order to solve their own big data problems. Despite having similar backgrounds and data problems, few of these people have sunk their teeth into Hadoop. When I go to Hadoop Meetups in San Francisco, I often meet new people who are evaluating Hadoop and have yet to launch a cluster. Based on my own background and experience, I have some ideas on why this is the case.

I studied computer science in school and have worked on a wide variety of computer systems in my career, with a lot of focus on server-side Java. I learned a bit about building distributed systems and working with large amounts of data when I built a pay-per-click (PPC) ad network in 2004. The system is still in operation and at one point was handling several thousand searches per second. As the sole technical resource on the system, I had to educate myself very quickly about how to scale up.

Apache Avro at RichRelevance

This is a guest post from RichRelevance Principal Architect and Apache Avro PMC Chair Scott Carey.

In Early 2010 at RichRelevance, we were searching for a new way to store our long lived data that was compact, efficient, and maintainable over time. We had been using Hadoop for about a year, and started with the basics – text formats and SequenceFiles. Neither of these were sufficient. Text formats are not compact enough, and can be painful to maintain over time. A basic binary format may be more compact, but it has the same maintenance issues as text. Furthermore, we needed rich data types including lists and nested records.

After analysis similar to Doug Cutting’s blog post, we chose Apache Avro. As a result we were able to eliminate manual version management, reduce joins during data processing, and adopt a new vision for what data belongs in our event logs. On Cyber Monday 2011, we logged 343 million page view events, and nearly 100 million other events into Avro data files.

Avoiding Version Management Baggage

Nominations Are Open for the 2011 Government Big Data Solutions Award

This post was contributed by Bob Gourley, editor, CTOvision.com.

The missions and data of governments make the government sector one of particular importance for Big Data solutions. Federal, State and Local governments have special abilities to focus research in areas like Health Sciences, Economics, Law Enforcement, Defense, Geographic Studies, Environmental Studies, Bioinformatics, Information Search/Discovery, and Computer Security. Government-Industry teams are working to field Big Data solutions in all these fields.

The Government Big Data Solutions Award was established by the technology blog CTOvision.com to help facilitate the exchange of best practices, lessons learned and creative ideas for solutions to hard data challenges in the government sector. Nominations are being sought and your input would be most appreciated. Please nominate capabilities and solutions you know hold great potential for mission impact in the government sector.

Evolution of Hadoop Ecosystem: AOL Advertising Experience

Pero works on research and development in new technologies for online advertising at Aol Advertising R&D in Palo Alto. Over the past 4 years he has been the Chief Architect of R&D distributed ecosystem comprising more than thousand nodes in multiple data centers. He also led large-scale contextual analysis, segmentation and machine learning efforts at AOL, Yahoo and Cadence Design Systems and published patents and research papers in these areas.

A critical premise for success of online advertising networks is to successfully collect, organize, analyze and use large volumes of data for decision making. Given the nature of their online orientation and dynamics, it is critical that these processes be automated to the largest extent possible.

Specifically, the success of advertising technology and its impact on revenue are directly proportional to its capability to use large amounts of data in order to compute proper impression value given the unique circumstances of ad serving events such as the characteristics of the impression, the ad, and the user as well as the content and context. As a general rule, more data results in more accurate predictions.

Migrating from Elastic MapReduce to a Cloudera’s Distribution including Apache Hadoop Cluster

This post was contributed by Jennie Cochran-Chinn and Joe Crobak. They are part of the team building out Adconion‘s Hadoop infrastructure to support Adconion’s next-generation ad optimization and reporting systems.


This is the first of a two part series about moving away from Amazon’s EMR service to an in-house Hadoop cluster.

When we first started using Hadoop, we went down the path of Amazon’s EMR service.  We had limited operational resources and wanted to get up and running quickly.  After a while, we starting hitting the limitations of EMR and had to migrate towards managing our own cluster.  In doing so we did not want to lose the features of EMR we found useful – mainly the ease of cluster setup.

Biodiversity Indexing: Migration from MySQL to Hadoop


This post was contributed by The Global Biodiversity Information Facility development team.

The Global Biodiversity Information Facility is an international organization, whose mission is to promote and enable free and open access to biodiversity data worldwide. Part of this includes operating a search, discovery and access system, known as the Data Portal; a sophisticated index to the content shared through GBIF. This content includes both complex taxonomies and occurrence data such as the recording of specimen collection events or species observations. While the taxonomic content requires careful data modeling and has its own challenges, it is the growing volume of occurrence data that attracts us to the Hadoop stack.

CDH 3 Demo VM installation on Mac OS X using VirtualBox

The first task is to ensure that your system is up-to-date.

This procedure has been tested on the following configuration:

Using Hadoop to Measure Influence

Background

Klout’s goal is to be the standard for influence. The advent of social media has created a huge number of measurable relationships. On Facebook, people have an average of 130 friends. On Twitter, the average number of followers range from 300+ to 1000+. With each relationship comes a different source of data. This has created A LOT of noise and an attention economy. Influence has the power to drive this attention.

When a company, brand, or person creates content, our goal is to measure the actions on that content. We want to measure every view, click, like, share, comment, retweet, mention, vote, check-in, recommendation, and so on. We want to know how influential the person who *acted* on that content is. We want to know the actual meaning of that content. And we want to know all of this, over time.

Measuring influence is a bit like trying to measure an emotion like hate or jealousy. It’s really hard and takes a boatload of data.

Three Reasons Why Apache Avro Data Serialization is a Good Choice for OpenRTB

This is a guest repost from the DataXu blog. Click here to view the original post.

I recently evaluated several serialization frameworks including Thrift, Protocol Buffersand Avro for a solution to address our needs as a demand side platform, but also for a protocol framework to use for the OpenRTB marketplace as well. The working draft of OpenRTB 2.0 uses simple JSON encoding, which has many advantages including simplicity and ubiquity of support. Many OpenRTB contributors requested we support at least one binary standard as well, to improve bandwidth usage and CPU processing time for real-time bidding at scale.

After reviewing many candidates, Apache Avro proved to be the best solution.