- by Assaf Yardeni
- May 03, 2012
- 1 comment
This is a guest post by Assaf Yardeni, Head of R&D for Treato, an online social healthcare solution, headquartered in Israel.
Three years ago I joined Treato, a social healthcare analysis firm to help treato.com scale up to its present capability. Treato is a new source for healthcare information where health-related user generated content (UGC) from the Internet is aggregated and organized into usable insights for patients, physicians and other healthcare professionals. With oceans of patient-written health-related information available on the Web, and more being published each day, Treato needs to be able to collect and process vast amounts of data – Treato is Big Data par excellence, and my job has been to bring Treato to this stage.
Before the Hadoop era
When I arrived at Treato, the team had already developed a Microsoft-based prototype that could organize a limited amount of health-related UGC into relevant insights, as a proof of concept. The system would:
San Francisco seems to be having an unusually high number of flu cases/searches this April, and the Cloudera Data Science Team has been hit pretty hard. Our normal activities (working on Crunch, speaking at conferences, finagling a job with the San Francisco Giants) have taken a back seat to bed rest, throat lozenges, and consuming massive quantities of orange juice. But this bit of downtime also gave us an opportunity to focus on solving a large-scale data science problem that helps some of the people who help humanity the most: epidemiologists.
Case-Control Studies
A case-control study is a type of observational study in which a researcher attempts to identify the factors that contribute to a medical condition by comparing a set of subjects who have that condition (the ‘cases’) to a set of subjects who do not have the condition, but otherwise resemble the case subjects (the ‘controls’). They are useful for exploratory analysis because they are relatively cheap to perform, and have led to many important discoveries- most famously, the link between smoking and lung cancer.
Epidemiologists and other researchers now have access to data sets that contain tens of millions of anonymized patient records. Tens of thousands of these patient records may include a particular disease that a researcher would like to analyze. In order to find enough unique control subjects for each case subject, a researcher may need to execute tens of thousands of queries against a database of patient records, and I have spoken to researchers who spend days performing this laborious task. Although they would like to parallelize these queries across multiple machines, there is a constraint that makes this problem a bit more interesting: each control subject may only be matched with at most one case subject. If we parallelize the queries across the case subjects, we need to check to be sure that we didn’t assign a control subject to multiple cases. If we parallelize the queries across the control subjects, we need to be sure that each case subject ends up with a sufficient number of control subjects. In either case, we still need to query the data an arbitrary number of times to ensure that the matching of cases and controls we come up with is feasible, let alone optimal.
When most people first hear about data science, it’s usually in the context of how prominent web companies work with very large data sets in order to predict clickthrough rates, make personalized recommendations, or analyze UI experiments. The solutions to these problems require expertise with statistics and machine learning, and so there is a general perception that data science is intimately tied to these fields. However, in my conversations at academic conferences and with Cloudera customers, I have found that many kinds of scientists– such as astronomers, geneticists, and geophysicists– are working with very large data sets in order to build models that do not involve statistics or machine learning, and that these scientists encounter data challenges that would be familiar to data scientists at Facebook, Twitter, and LinkedIn.
The Practice of Data Science
The term “data science” has been subject to criticism on the grounds that it doesn’t mean anything, e.g., “What science doesn’t involve data?” or “Isn’t data science a rebranding of statistics?” The source of this criticism could be that data science is not a solitary discipline, but rather a set of techniques used by many scientists to solve problems across a wide array of scientific fields. As DJ Patil wrote in his excellent overview of building data science teams, the key trait of all data scientists is the understanding “that the heavy lifting of [data] cleanup and preparation isn’t something that gets in the way of solving the problem: it is the problem.”
I have found a few more characteristics that apply to the work of data scientists, regardless of their field of research:
- Inverse problems. Not every data scientist is a statistician, but all data scientists are interested in extracting information about complex systems from observed data, and so we can say that data science is related to the study of inverse problems. Inverse problems arise in almost every branch of science, including medical imaging, remote sensing, and astronomy. We can also think of DNA sequencing as an inverse problem, in which the genome is the underlying model that we wish to reconstruct from a collection of observed DNA fragments. Real-world inverse problems are often ill-posed or ill-conditioned, which means that scientists need substantive expertise in the field in order to apply reasonable regularization conditions in order to solve the problem.
- Data sets that have a rich set of relationships between observations. We might think of this as a kind of Metcalfe’s Law for data sets, where the value of a data set increases nonlinearly with each additional observation. For example, a single web page doesn’t have very much value, but 128 billion web pages can be used to build a search engine. A DNA fragment in isolation isn’t very useful, but millions of them can be combined to sequence a genome. A single adverse drug event could have any number of explanations, but millions of them can be processed to detect suspicious drug interactions. In each of these examples, the individual records have rich relationships that enhance the value of the data set as a whole.
- Open-source software tools with an emphasis on data visualization. One indicator that a research area is full of data scientists is an active community of open source developers. The R Project is a widely known and used toolset that cuts across a variety of disciplines, and has even been used as a basis for specialized projects like Bioconductor. Astronomers have been using tools like AIPS for processing data from radio telescopes and IRAF for data from optical telescopes for more than 30 years. Bowtie is an open source project for performing very fast DNA sequence alignment, and the Crossbow Project combines Bowtie with Apache Hadoop for distributed sequence alignment processing.
Part 1 of this post covered how to convert and store email messages for archival purposes using Apache Hadoop, and outlined how to perform a rudimentary search through those archives. But, let’s face it: for search to be of any real value, you need robust features and a fast response time. To accomplish this we use Solr/Lucene-type indexing capabilities on top of HDFS and MapReduce.
Before getting into indexing within Hadoop, let us review the features of Lucene and Solr:
Apache Lucene and Apache Solr
Apache Lucene is a mature, high performance, full-featured Java API used for indexing and searching that has been around since the late nineties — it supports field-specific indexing and searching, sorting, highlighting, and wildcard searches, to name only a few. Everything in Lucene boils down to creating a document using artifacts such as email messages, HTML, PDF, XML, Word, Excel, etc, the contents of which will end up being parsed and added to Lucene documents as name/value pairs. There are a number of libraries available for extracting actual content, depending on what the artifact is. When extracting content from .msg email files, for instance, TIKA and POI are some useful libraries.
- by Alex Loddengaard
- December 06, 2011
- no comments
This guest blog post is from Alex Loddengaard, creator of FoneDoktor, an Android app that monitors phone usage and recommends performance and battery life improvements. FoneDoktor uses WibiData, a data platform built on Apache HBase from Cloudera’s Distribution including Apache Hadoop, to store and analyze Android usage data. In this post, Alex will discuss FoneDoktor’s implementation and discuss why WibiData was a good data solution. A version of this post originally appeared at the WibiData blog.
At last month’s Hadoop World, one of the sessions spotlighted FoneDoktor, an Android app that collects data about device performance and app resource usage to offer personalized battery and performance improvement recommendations directly to users. In this post, I’ll talk about how I used WibiData — a system built on Apache HBase from CDH — as FoneDoktor’s primary data storage, access, and analysis system.
WibiData is an integrated system for managing, analyzing and serving complex user data in support of investigative and operational analytic workloads. It leverages HBase to combine batch analysis and real time access within the same system, and integrates with existing BI, reporting and analysis tools. Having used Hadoop for over four years now, I was insanely impressed with the simplicity that WibiData brings to apps that need to store, access, and analyze massive amounts of user data. Read on for how I used it to build FoneDoktor.
What is FoneDoktor?
Last month at the Web 2.0 Summit in San Francisco, Cloudera CEO Mike Olson presented some work the Cloudera Data Science Team did to analyze adverse drug events. We decided to share more detail about this project because it demonstrates how to use a variety of open-source tools – R, Gephi, and Cloudera’s Distribution Including Apache Hadoop (CDH) – to solve an old problem in a new way.
Background: Adverse Drug Events
An adverse drug event (ADE) is an unwanted or unintended reaction that results from the normal use of one or more medications. The consequences of ADEs range from mild allergic reactions to death, with one study estimating that 9.7% of adverse drug events lead to permanent disability. Another study showed that each patient who experiences an ADE remains hospitalized for an additional 1-5 days and costs the hospital up to $9,000.
Some adverse drug events are caused by drug interactions, where two or more prescription or over-the-counter (OTC) drugs taken together leads to an unexpected outcome. As the population ages and more patients are treated for multiple health conditions, the risk of ADEs from drug interactions increases. In the United States, roughly 4% of adults older than 55 are at risk for a major drug interaction.
Check out the Hadoop World 2011 conference agenda!
Find sessions of interest and begin planning your Hadoop World experience among the sixty breakout sessions spread across five simultaneous tracks at http://www.hadoopworld.com/agenda/.
Preview of Operations Track and Sessions
The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.
Preview of Development Track Sessions
Building Web Analytics Processing on Hadoop at CBS Interactive
Michael Sun, CBS Interactive
- by Sunil Sitaula
- September 28, 2011
- 7 comments
This post will explore a specific use case for Apache Hadoop, one that is not commonly recognized, but is gaining interest behind the scenes. It has to do with converting, storing, and searching email messages using the Hadoop platform for archival purposes.
Most of us in IT/Datacenters know the challenges behind storing years of corporate mailboxes and providing an interface for users to search them as necessary. The sheer volume of messages, the content structure and its complexity, the migration processes, and the need to provide timely search results stand out as key points that must be addressed before embarking on an actual implementation. For example, in some organizations all email messages are stored in production servers; others just create a backup dump and store them in tapes; and some organizations have proper archival processes that include search features. Regardless of the situation, it is essential to be able to store and search emails because of the critical information they hold as well as for legal compliance, investigation, etc. That said, let’s look at how Hadoop could help make this process somewhat simple, cost effective, manageable, and scalable.
Big files are ideal for Hadoop. It can store them, provide fault tolerance, and process them in parallel when needed. As such, the first step in the journey to an email archival solution is to convert your email messages to large files. In this case we will convert them to flat files called sequence files, composed of binary key-value pairs. One way to accomplish this is to:
BusinessWeek recently published a fascinating article on Hadoop and Big Data, interviewing several Cloudera customers as well as our CEO Mike Olson. One of the things that has consistently exceeded our expectations is the diversity of industries that are adopting Hadoop to solve impressive business challenges and create real value for their organizations. Two distinct use cases that Hadoop is used to tackle have emerged across these industries. Though these have different names in each industry, the mechanics have clear parallels that cross domains.
Data Processing:
Data Processing is Hadoop’s original use case. By scaling out the amount of data that users could store and access in a single system then distributing the document and log processing used to index, and extract patterns from this data, Hadoop made a direct impact on the web and online advertising industries early on. Today, data processing means more than sessionization of click stream data, index construction or attribution for advertising. Hadoop is used to process data by commerce, media and telecommunications companies in order to measure engagement, and handle complex mediation. Retail and financial institutions use Hadoop to understand customer preferences, better target prices and reconcile trades. Most recently we’re seeing Hadoop used for time series and signal processing in the energy sector and genome mapping and alignment among life sciences organizations.
Advanced Analytics:
Today, Hadoop is not only used for data processing, but also advanced analytics. A slightly ambiguous term, we’ve found organizations speaking of advanced analytics as a way of referring to the types of analytics that are challenging or impossible using tools that are optimized for relational analysis. These new challenges include social network analysis and smarter targeting by web and advertising companies, content optimization by a wide variety of publishers and network analysis at media and telecommunications companies. Our retail customers look at the effectiveness of loyalty programs well beyond the classic market basket analysis as customer engagement crosses online and real world interactions. Financial institutions are able to take a deeper look at complex fraud and risk among their customers and throughout their internal systems. Both entity analysis and analysis based on next generation sequencing have recently been made possible using native Hadoop libraries.
- Overview
- Downloads
- Learn Hadoop
- Get Support
-
Blog
- Avro (11)
- Careers (10)
- CDH (29)
- Cloudera Manager (10)
- Cloudera's Service And Configuration Manager (6)
- Community (86)
- Connector (6)
- Data Collection (13)
- Distribution (34)
- Flume (6)
- General (237)
- Guest (35)
- Hadoop (146)
- HBase (40)
- HDFS (26)
- Hive (22)
- MapReduce (37)
- Oozie (4)
- Pig (15)
- Sqoop (9)
- Testing (5)
- Training (18)
- Use Case (11)
- Whirr (1)
- ZooKeeper (10)
- Archives by Month

Hadoop was created by 