<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; guest</title>
	<atom:link href="http://www.cloudera.com/blog/category/guest/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Thu, 24 May 2012 17:53:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>How Treato Analyzes Health-related Social Media Big Data with Hadoop and HBase</title>
		<link>http://www.cloudera.com/blog/2012/05/treato-analyzes-health-related-big-data-with-hadoop/</link>
		<comments>http://www.cloudera.com/blog/2012/05/treato-analyzes-health-related-big-data-with-hadoop/#comments</comments>
		<pubDate>Thu, 03 May 2012 13:00:51 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[CDH]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Use Case]]></category>
		<category><![CDATA[Cloudera Case Study]]></category>
		<category><![CDATA[Hadoop Case Study]]></category>
		<category><![CDATA[Hadoop in Healthcare]]></category>
		<category><![CDATA[hadoop use case]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=14627</guid>
		<description><![CDATA[This is a guest post by Assaf Yardeni, Head of R&#38;D for Treato, an online social healthcare solution, headquartered in Israel. Three years ago I joined Treato, a social healthcare analysis firm to help treato.com scale up to its present capability. Treato is a new source for healthcare information where health-related user generated content (UGC) [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is a guest post by Assaf Yardeni, Head of R&amp;D for Treato, an online social healthcare solution, headquartered in Israel. </em></p>
<p>Three years ago I joined <a href="http://treato.com/" target="_blank">Treato</a>, a social healthcare analysis firm to help <a href="http://www.treato.com/" target="_blank">treato.com</a> scale up to its present capability. Treato is a new source for healthcare information where health-related user generated content (UGC) from the Internet is aggregated and organized into usable insights for patients, physicians and other healthcare professionals. With oceans of patient-written health-related information available on the Web, and more being published each day, Treato needs to be able to collect and process vast amounts of data – Treato is Big Data par excellence, and my job has been to bring Treato to this stage.</p>
<h2 style="font-size: 14pt; color: #243543;">Before the Hadoop era</h2>
<p>When I arrived at Treato, the team had already developed a Microsoft-based prototype that could organize a limited amount of health-related UGC into relevant insights, as a proof of concept. The system would:</p>
<ul>
<li>Crawl the Web and fetch raw HTML sources,</li>
<li>Extract the user-generated content (i.e. user’s posts) out of the raw sources,</li>
<li>Extract concepts from the posts and index them,</li>
<li>Execute semantic analysis on the posts using natural language processing (NLP) algorithms</li>
<li>And calculate statistics.</li>
</ul>
<p>The prototype was able to prove the initial hypothesis that relevant medical insights can be found in social media, you just have to know how to analyze it. We collected data from dozens of websites and individual social media posts in the tens of millions. We had a handful of text analysis algorithms and could only process a couple million posts per day, but the results were impressive. We found that we were able to identify side effects through social media long before initial FDA or pharmaceutical companies issued warnings about them. For example, when we looked at the discussions about Singulair – an asthma medication – we found that almost half of the user generated content discussed mental disorders. When we looked back through the historical data, we learned that this would have been identifiable in our data four years before the official warning.</p>
<p>In order to gain even more health-related insights, we knew we needed a solution that could crawl and process a larger quantity of data – larger by an order of magnitude. That was the point at which Web scale joined the game. In order to collect massive amounts of posts, we needed to add thousands of data sources. And, of course, all the data we collected would need to be analyzed.</p>
<p>Dealing with a few dozen websites was difficult and costly. But we were able to scale up our Microsoft code to handle collection from a several hundred sites, and could process around 250 million posts. We were running a few old IBM boxes that did the collection work and had developed a job manager that administered crawling and fetching tasks. Different servers ran the indexing and the stats calculations, and we had developed a distributed job manager to direct task executions. Different servers were used for serving the data. We didn&#8217;t have any storage solution, and all of the boxes worked with local drives.</p>
<p>Besides the fact that administering the process was hell, it was expensive in terms of CPU, network and input/output (I/O); e.g., after each stage, the data needed to be moved to a different server for the next stage. In addition, our job manager didn’t deal with failures; every time a task failed we needed to handle it manually. Needless to say, supporting collection and analysis of thousands of websites would have been impossible using this approach.</p>
<h2 style="font-size: 14pt; color: #243543;">Looking at scale</h2>
<p>In the beginning of 2010, we started searching for solutions that could support the capabilities we wanted. The requirements included:</p>
<ol>
<li>Reliable and scalable storage.</li>
<li>Reliable and scalable processing infrastructure.</li>
<li>Search engine capabilities (for retrieving posts) with high availability (HA).</li>
<li>Scalable real-time store for retrieving stats, with HA.</li>
</ol>
<p>We wanted the ability to periodically reprocess the data in a timely manner, so new algorithms or other analysis improvements would take effect on all historical data.</p>
<p>We wanted to know how much it costs to deal with X number of posts, and to be able to scale according to this formula.</p>
<p>We wanted a technology and architecture that would scale with the business.</p>
<p>We searched for answers to questions such as: &#8220;How does Google do it?” and it didn&#8217;t take too long to find Google&#8217;s papers, documentation on Hadoop and MapReduce, and so on.</p>
<p>We started digging deeper in these areas. After a short investigation, it was clear that the Hadoop Distributed File System (HDFS) would support our storage demands, and MapReduce would be a good fit for the processing infrastructure.</p>
<h2 style="font-size: 14pt; color: #243543;">First Hadoop cluster in the lab</h2>
<p>While looking for Hadoop distributions, I encountered <a href="http://www.cloudera.com/hadoop/">Cloudera&#8217;s Distribution including Apache Hadoop</a> (CDH), however, I decided to start with a manual installation since this usually helps me better understand how things work. We started a pilot, setting up a 2 node cluster on Linux boxes. As mentioned, the first installation was done totally manually using the binaries downloaded from Apache, and gently configuring the system. This process was ugly: I needed to download all sorts of binaries from different sources, deal with networking issues, exchange of SSH keys between the nodes, formatting the FS and all sorts of OS tweaks.</p>
<p>We started testing the behavior of the new technology, first with some simple WordCount and pi calculations, and then we quickly wrote MapReduce (Java) code that did parts of our processing and tested it on real HTML sources. The little cluster just worked: I was able to submit jobs &amp; monitor them; I tested recovery from task failures, crash of a node, etc.</p>
<p>Next, I wanted to see how this Hadoop solution scaled. To do this, I installed an additional box and added it to our little Hadoop cluster. It was awesome: after adding the new slave to the cluster, everything was transparent. Suddenly we had more capacity on the file system and more horsepower for processing. The job submission was the same as before; the job submitter (Hadoop client) didn&#8217;t even know that the cluster had changed, it simply got the results quicker. We were able to crunch some numbers and got a dollar-per-post cost.</p>
<p>So, the evaluation was great, but still there was the awful installation and maintenance process. That’s when we started to test <a href="http://www.cloudera.com/hadoop/">Cloudera&#8217;s Distribution including Apache Hadoop</a>; I think it was version 2 of CDH back then. We re-installed our little cluster from scratch using this Hadoop distribution. The installation process was much easier, and the documentation helped. The setup took only a couple of hours. (CDH3 takes less than an hour). </p>
<p>After we found a good package, we wanted to set up a bigger cluster for prototyping, and deeper tests and evaluations. Amazon seemed to be the perfect place for that. Using CDH we set up a 10 node (small instances) cluster on EC2. This was used for performance evaluation and the processing rate was about 10M-20M posts per day &#8212; approximately 6 times higher than the performance from our pre-Hadoop solution.</p>
<p>We decided to go with Hadoop. This was a dramatic decision, as we took a company with a Microsoft-oriented development team, ported all the code into Java, all the while adopting a new and very complicated technology stack. This actually meant starting implementation from the beginning, opening a new integrated development environment (IDE) and starting to code from scratch. </p>
<p>In order to reduce risks and avoid critical mistakes, we searched for someone who has &#8220;been there, done that&#8221; so we could learn from them and validate our overall planned new architecture. Cloudera was our first choice; it made sense to go with a company that has multiple setups behind them, some of which are very large clusters. Cloudera sent Solutions Architect, Lars George, to our offices for two days, and we gave him our suggested design in advance. We felt lucky to have Lars, an HBase committer and author of <a href="http://shop.oreilly.com/product/0636920014348.do" target="_blank"><em>HBase: The Definitive Guide</em></a>,<em> </em>as our consultant since HBase was one of the core technologies we were using.</p>
<p>For the first implementation phase, we decided to go with HDFS, MapReduce &amp; HBase. Our in-house-developed crawlers were using HBase as the store for the list of URLs to be fetched. This table should be able to scale to billions of rows. The fetcher (the component in charge of fetching the raw HTML sources) gets the URL queues out of HBase, runs HTTP requests, and stores the raw HTML sources in large files on top of HDFS (few gigs per file). Both the crawler and fetcher don’t use any relational database or any other kind of store except HDFS &amp; HBase. These two components are network and I/O intensive, but CPU is not much of an issue.</p>
<p>Next comes the processing. Each line in the HDFS files contains an HTML source and metadata related to this source. For each directory of files in HDFS, the following processing jobs need to be executed:</p>
<ol>
<li>Turn the unstructured HTML into a list of post entities (content, timestamp, etc.)</li>
<li>Each post needs to be processed as follows:</li>
<ul>
<li>Index key terms – extract medical concepts out of the post content, using Treato&#8217;s extensive knowledge base</li>
<li>Execute text analysis algorithms</li>
</ul>
<li>Calculate all statistics and update the HBase stats tables.</li>
<li>Post all documents (user’s posts) into our search engine (Solr).</li>
</ol>
<p>During this process, many database queries and updates are needed. For example, each post retrieved may potentially already exist in our system, and of course we don&#8217;t want to add a duplicate post to our system, nor invest processing power on documents we already have. In order to accomplish this, we need to calculate a hash for each post, and then check it against a database containing all of the existing hashes. For this purpose HBase works perfectly in terms of both latency and load.</p>
<p>After the design phase, we started implementation. All R&amp;D teams worked on porting their code into Java, and our Ops team worked on planning the data center (we decided on co-location data center setup).</p>
<p>For the initial setup, we had 11 boxes that comprised our Hadoop cluster, two of which were name nodes in an active / passive mode (one was in standby for manual failover in case the active NameNode failed). Nine nodes were slaves hosting DataNodes, TaskTrackers and Region-Servers daemons. In addition to this we had three boxes running Zookeeper services.</p>
<p>The new system was capable of analyzing 50M posts per day. This was a significant performance improvement. In addition, it was reasonable to maintain, reliable and worked quite smoothly. Of course, there were bumps in the road, but in the end we managed to overcome them all.</p>
<p>We have continued to improve and expand the solution, and today we can process 150 – 200 million user posts per day. In subsequent blog posts, I will share in greater detail our system design, use of HBase, and cluster architecture.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/05/treato-analyzes-health-related-big-data-with-hadoop/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>How I found Hadoop</title>
		<link>http://www.cloudera.com/blog/2011/12/how-i-found-hadoop/</link>
		<comments>http://www.cloudera.com/blog/2011/12/how-i-found-hadoop/#comments</comments>
		<pubDate>Wed, 28 Dec 2011 13:00:44 +0000</pubDate>
		<dc:creator>Omer</dc:creator>
				<category><![CDATA[careers]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[apache hadoop]]></category>
		<category><![CDATA[Finding Hadoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=10019</guid>
		<description><![CDATA[This is a guest post contributed by Loren Siebert. Loren is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program. A year ago I rolled my first Hadoop system into production. Since then, I&#8217;ve spoken to quite a few people who are eager to try Hadoop themselves [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is a guest post contributed by Loren Siebert. Loren is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.</em></p>
<p>A year ago I rolled my first Hadoop system into production. Since then, I&#8217;ve spoken to quite a few people who are eager to try Hadoop themselves in order to solve their own big data problems. Despite having similar backgrounds and data problems, few of these people have sunk their teeth into Hadoop. When I go to <a href="http://www.meetup.com/hadoopsf/" target="_blank">Hadoop Meetups in San Francisco</a>, I often meet new people who are evaluating Hadoop and have yet to launch a cluster. Based on my own background and experience, I have some ideas on why this is the case.</p>
<p>I studied computer science in school and have worked on a wide variety of computer systems in my career, with a lot of focus on server-side Java. I learned a bit about building distributed systems and working with large amounts of data when I built a pay-per-click (PPC) ad network in 2004. The system is still in operation and at one point was handling several thousand searches per second. As the sole technical resource on the system, I had to educate myself very quickly about how to scale up.</p>
<p>As I contemplated how doomed I would be should traffic levels increase much more, I remember wondering to myself, &#8220;How does Google deal with all that data?&#8221; The answer came to me in the form of the Google File System (GFS) paper and later the MapReduce paper, both from Google. It dawned on me that because Google was forced to solve a much larger problem, they had come up with an elegant solution for a whole range of more modest data problems running on commodity hardware. But it wouldn&#8217;t be until 2010 that I would get to work with this technology firsthand.</p>
<p>As I wrote in an <a href="http://www.cloudera.com/blog/2011/04/adopting-apache-hadoop-in-the-federal-government/" target="_blank">earlier article</a>, I started re-architecting USASearch, the U.S. government&#8217;s search system, in 2009 based on a solution stack of free, open source software including Ruby on Rails, Solr, and MySQL. A wave of d&#233;ja vu hit me as I started worrying about what to do with the growing mountain of data piling up in MySQL and our increasing need to analyze it in different ways. I had heard that a new company called Cloudera, founded by some big data people from Yahoo!, Google, and Facebook, was making Hadoop available for the masses in a reliable distribution, much in the same way that RedHat did for Linux. Curiosity got the best of me and I bought the newly minted <em>Hadoop: The Definitive Guide</em> from O&#8217;Reilly. The most insightful part of the book to me was the very first sentence. It&#8217;s a quote from Grace Hopper: &#8220;In pioneer days, they used oxen for heavy pulling, and when one ox couldn&#8217;t budge a log, they didn&#8217;t try to grow a bigger ox.&#8221; I didn&#8217;t want to grow a bigger server; I wanted to harness a bunch of small servers together to work in unison. The more I learned the more curious I got, so I started reading more. And that&#8217;s when I hit my first roadblock.</p>
<p>I think people who have been working with Hadoop technologies for years and years sometimes forget just how rich and diverse the big data software ecosystem has become, and how daunting it can be to folks approaching it for the first time. When people at the Meetups say they are evaluating solutions to their data scaling problem, the answers they hear sound something like this: &#8220;Just use Hadoop Hive Pig Mahout Avro HBase Cassandra Oozie Sqoop Flume ZooKeeper Cascading NoSQL RCFile. Oh, almost forgot&#8230;cloud.&#8221;</p>
<p>The thought of wading through all of that just to learn about what I needed to learn about was a bit too overwhelming for me, so I put the whole matter aside for a few months. Over time, I started to dive into each of these projects to understand the primary use case, how active the developer community was and which organizations were using it in production. I converged on the idea of using Hive as a warehouse for our data. I opted for <a href="https://ccp.cloudera.com/display/SUPPORT/Downloads" target="_blank" title="Cloudera's Distribution including Apache Hadoop">Cloudera&#8217;s distribution</a> since I wanted to reduce the risk of running into compatibility issues between all the various subsystems. Having tracked down anomalies in a highly multi-threaded and contentious distributed Java system before, I liked the idea of someone else taking on that problem for me.</p>
<p>At some point, I had read everything I could read and grew impatient to get my hands dirty, so I decided to just download <a href="https://ccp.cloudera.com/display/SUPPORT/Downloads"><abbr title="Cloudera's Distribution including Apache Hadoop 3">CDH3</abbr></a> on my laptop and give it a try. The <a href="https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial">tutorial</a> instructions for the standalone version worked, and I successfully computed more digits of pi than I ever thought I&#8217;d need. After creating some sample data in Hive and running a few queries, I felt pretty confident that Hive would be the right tool for the job. I just needed to find somewhere to install and run HDFS (namenode, secondary namenode, and data nodes), Hadoop (jobtracker and tasktracker nodes), Hive, and Hue for a nice front end to it all.</p>
<p>I knew from my past experience how to stretch the limits of CPU, disk, IO, and memory on commodity servers, and I identified a few potential servers at our primary datacenter with resources I figured I could leverage. Once again I followed the tutorial instructions, this time for the fully distributed version of CDH3, and once again I started to compute pi. And that&#8217;s when I hit my second roadblock. It took me a few days&#160;to figure out that I had a problem with DNS. Each machine needs to be able to resolve every other machine&#8217;s name and IP in the cluster. Whether you do that via /etc/hosts or a local DNS server is up to you, but it needs to happen or the whole thing gets wedged. Once I got that sorted out, everything just started falling into place and I had Hive working in production within a few days. A week later, I started pulling out the MySQL jobs and deleting big tables, and that&#8217;s been the trend ever since.</p>
<p>Over time, I&#8217;ve gone on to learn about using custom Ruby mappers in Hive, moving data back and forth between MySQL and Hive with Sqoop, and getting the data into HDFS in real-time with Flume. All of these components from the <a href="https://ccp.cloudera.com/display/SUPPORT/Downloads" target="_blank" title="Cloudera's Distribution including Apache Hadoop">Cloudera distribution</a> are working nicely in our production environment now, and I sleep well at night knowing I have such a solid, deliberate plan for growth. My initial investment in learning about the Hadoop ecosystem is really paying dividends, but when I think about all those people at the Meetups stuck in evaluation mode, I feel their pain. Does it have to be such a struggle?</p>
<p>The big challenge in my opinion is not that any one piece of the puzzle is too difficult. Any reasonably smart (or in my case stubborn) engineer can set themselves on the task of learning about a new technology once they know that it needs to be learned. The challenge with the Hadoop ecosystem is that it presents the newbie with the meta-problem of figuring out which of these tools are appropriate for their use case at all, and whether or not to even consider the problem today versus deferring it until later. In a way Facebook has it easy, because when you are adding 15TB of data per day, that decision is pretty much made for you.</p>
<p>For all the companies sitting in the twilight between the gigabyte and the petabyte who don&#8217;t have Hadoop expertise in-house, there is a collection of free information to help guide people to the right solution space (<a href="https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial">Hadoop Tutorial</a>, <a href="http://www.cloudera.com/resources/White+Paper/">White Papers</a>). These days, when I talk to people who are evaluating solutions to their big data problems, my advice to them is to break down their problems into a few discrete use cases and then work on ferreting out the technologies that are designed for that use case. Get a proof of concept to demonstrate that the technology can address your use case and convince yourself and others that you&#8217;re on the right track. Work toward putting something simple into production. Lather, rinse, and repeat. I am still in that cycle myself, as these days I&#8217;m exploring <a href="http://hbase.apache.org/" target="_blank">HBase</a> and <a href="http://opentsdb.net/" target="_blank">OpenTSDB</a> to give me low-latency access to time series data and <a href="http://mahout.apache.org/" target="_blank">Mahout</a> to do frequent item set mining, but that&#8217;s another article for another day.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/12/how-i-found-hadoop/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Apache Avro at RichRelevance</title>
		<link>http://www.cloudera.com/blog/2011/12/apache-avro-at-richrelevance/</link>
		<comments>http://www.cloudera.com/blog/2011/12/apache-avro-at-richrelevance/#comments</comments>
		<pubDate>Thu, 22 Dec 2011 13:00:40 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[Apache Avro]]></category>
		<category><![CDATA[Guest Post]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=10068</guid>
		<description><![CDATA[This is a guest post from RichRelevance Principal Architect and Apache Avro PMC Chair Scott Carey. In Early 2010 at RichRelevance, we were searching for a new way to store our long lived data that was compact, efficient, and maintainable over time. We had been using Hadoop for about a year, and started with the [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is a guest post from RichRelevance Principal Architect and Apache Avro PMC Chair Scott Carey.</em></p>
<p>In Early 2010 at <a href="http://www.richrelevance.com/" target="_blank">RichRelevance</a>, we were searching for a new way to store our long lived data that was compact, efficient, and maintainable over time.  We had been using Hadoop for about a year, and started with the basics &#8211; text formats and SequenceFiles.  Neither of these were sufficient.  Text formats are not compact enough, and can be painful to maintain over time.  A basic binary format may be more compact, but it has the same maintenance issues as text.  Furthermore, we needed rich data types including lists and nested records.</p>
<p>After analysis similar to <a href="http://www.cloudera.com/blog/2011/07/avro-data-interop/" target="_blank">Doug Cutting&#8217;s blog post</a>, we chose <a href="http://avro.apache.org/" target="_blank">Apache Avro</a>.  As a result we were able to eliminate manual version management, reduce joins during data processing, and adopt a new vision for what data belongs in our event logs.  On Cyber Monday 2011, we logged 343 million page view events, and nearly 100 million other events into Avro data files.</p>
<h2>Avoiding Version Management Baggage</h2>
<p>Have you ever seen code for manual serialization version management like the below?</p>
<pre class="code">
int version = input.readInt();
this.name = input.readName();
this.age = input.readInt();
if (version >= 2) {
  this.favoriteColor = input.readString();
} else {
  this.favoriteColor = "";
}</pre>
</p>
<p>Manual version management is painful.  If you evolve what data you store continuously, it does not take long to end up with dozens of versions. In order to read every version that has been written and stored, your code has to carry a lot of baggage. </p>
<p>With Avro, you can avoid writing code like the above.  The concept is simple.  Store the schema used to write your data along with your data, and use it to make the written data conform to the schema that the reader expects.  If a field is missing, use the default.  If has been removed or moved, handle it.</p>
<p>Over the last two years, we have doubled the complexity of our our page view schema across about 15 schema versions.  There is not one line of code that deals with version management, and the current code can read any of the data written over that time.  What may be a surprise to some is that our old code can read newly written data as well.  The data is both forward and backward compatible, within the rules in the <a href="http://avro.apache.org/docs/current/spec.html#Schema+Resolution" target="_blank">Avro Specification</a>.</p>
<h2>Leveraging Complex Data Types</h2>
<p>Avro supports complex data types such as arrays, maps, enumerations, and nested records.  The Avro data model makes it possible to serialize any non-recursive data structure, including trees and heterogeneous lists.  We use this property to describe our events using Avro Schemas that map to natural object representations on our front end servers.  For example, one of the elements in a page view is an array of product recommendation sets, each set containing a list of products displayed.  Another element in a page view is what we call a page context &#8211; each type of page on a merchant&#8217;s site has a unique context that differs from other page types.  A product page context is the product being displayed.  A search page context is the search terms in the search query.  There are about 30 page context types, and we represent the range of page context possibilities using an Avro union, so that all of these different event variations can be written in the same format and to the same log.</p>
<p>With a simpler data model, one might have had to log each context type separately, making it harder to get a full picture of what happened in a single request during analysis.</p>
<h2>A New Vision for an Event Log</h2>
<p>With the above properties of Avro, we were able to formulate a new vision for what an event log should be. The new model has the following properties:</p>
<ul>
<li>A singe HTTP request creates a single, atomic log event defined by an Avro schema</li>
<li>The event contains all of the resolved request inputs</li>
<li>The event contains the result of any decisions made during the request</li>
</ul>
<p>Together, these imply that it is never necessary to join different sets of data together to reconstruct what happened in an individual request during analysis.  This also significantly reduces the value of data contained in raw HTTP logs, since the Avro based logs become the origin for all major processing.  Since raw HTTP logs are significantly larger than compressed binary structured data, this significantly reduces the size of data we must keep for long periods of time.</p>
<h2>More Avro at RichRelevance</h2>
<p>We have built Hive and Pig adapters to map our Avro data into these tools for ad-hoc queries and automated tasks.  Additionally, we leverage the same Avro schemas from our log files to store click streams in HBase.  We also use Avro to store data compactly in key-value stores. </p>
<p>The log file example is what I call a schema first use case of Avro, where we define a schema for log events that can be used across different systems over a long period of time.  An alternative usage style is what I call code first, where you start with code and bind that to a serialization with a less schema-centric view.  I feel that the code first usage style is more applicable for data that lives for short or medium time scales, such as with RPC or MapReduce intermediates.  We will be deepening our investment in Avro and using it with code first use cases in the future, in the process working with the community to improve the developer experience for those use cases.</p>
<p>Avro is a growing, evolving project that I see as more broad than a serialization framework.  At heart Avro is about applying a schema to data, in order to manipulate that data in well defined ways.  Serialization, validation, and transformation are only some of the operations you can apply to data that conforms to a known schema.  Over time the project will grow to have more and more functionality centered around operations you can apply to data that conforms to an Avro Schema.  I look forward to working with the Avro community as the project continues to evolve!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/12/apache-avro-at-richrelevance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Nominations Are Open for the 2011 Government Big Data Solutions Award</title>
		<link>http://www.cloudera.com/blog/2011/10/nominations-are-open-for-the-2011-government-big-data-solutions-award/</link>
		<comments>http://www.cloudera.com/blog/2011/10/nominations-are-open-for-the-2011-government-big-data-solutions-award/#comments</comments>
		<pubDate>Thu, 13 Oct 2011 13:00:53 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hadoop conference]]></category>
		<category><![CDATA[hadoop world]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9255</guid>
		<description><![CDATA[This post was contributed by Bob Gourley, editor, CTOvision.com. The missions and&#160;data&#160;of governments make the government sector one of particular importance for&#160;Big&#160;Data&#160;solutions. Federal, State and Local governments have special abilities to focus research in areas like&#160;Health Sciences, Economics, Law Enforcement, Defense, Geographic Studies, Environmental Studies, Bioinformatics, Information Search/Discovery, and Computer Security. Government-Industry teams are working [...]]]></description>
			<content:encoded><![CDATA[<p><em>This post was contributed by Bob Gourley, editor, <a href="http://ctovision.com/" tagert="_blank">CTOvision.com</a>.</em></p>
<p>The missions and&#160;data&#160;of governments make the government sector one of particular importance for&#160;Big&#160;Data&#160;solutions. Federal, State and Local governments have special abilities to focus research in areas like&#160;Health Sciences, Economics, Law Enforcement, Defense, Geographic Studies, Environmental Studies, Bioinformatics, Information Search/Discovery, and Computer Security. Government-Industry teams are working to field&#160;Big&#160;Data&#160;solutions in all these fields.</p>
<p>The Government&#160;Big&#160;Data&#160;Solutions&#160;Award&#160;was established by the technology blog CTOvision.com to help&#160;facilitate&#160;the exchange of best practices, lessons learned and creative ideas for solutions to hard data&#160;challenges in the government sector. Nominations are being sought and your input would be most appreciated. Please nominate capabilities and solutions you know hold great potential for mission impact in the government sector.</p>
<p>Award&#160;winners will be written up on CTOvision.com, and a presentation of awards will also be made at&#160;<a href="http://ctovision.com/2011/09/do-you-use-data-register-now-for-hadoop-world-2011-to-help-create-the-future/" target="_blank">Hadoop World 2011</a>.</p>
<p>You may nominate industry solution providers with proven capabilities, government organizations who have built or implemented&#160;Big&#160;Data&#160;solutions or individuals who have played a direct role in establishing capability. The key criteria we are evaluating is the utility of the solution to serve government missions.</p>
<p>Our judges include:&#160;Doug Cutting (creator of Lucene and Hadoop), Alan Wade (former CIA and IC CIO), Ryan Lasalle (Accenture Cyber R&amp;D), Ed Granstedt (QinetiQ Strategic Solution Center) and Chris Dorobek (Founder, editor and publisher of<a href="http://www.dorobekinsider.com/" target="_blank"> DorobekInsider.com</a>).</p>
<p>To nominate, please use our online survey form at:&#160;<a href="http://crucialpointllc.us1.list-manage.com/track/click?u=4cb4c08d876d7481bbc4bc70f&amp;id=25c85fa938&amp;e=234fdd5ea5" target="_blank">https://www.surveymonkey.com/s/CNHB5QZ</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/10/nominations-are-open-for-the-2011-government-big-data-solutions-award/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Evolution of Hadoop Ecosystem: AOL Advertising Experience</title>
		<link>http://www.cloudera.com/blog/2011/07/evolution-of-hadoop-ecosystem-aol-advertising-experience/</link>
		<comments>http://www.cloudera.com/blog/2011/07/evolution-of-hadoop-ecosystem-aol-advertising-experience/#comments</comments>
		<pubDate>Tue, 12 Jul 2011 13:00:45 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[CDH]]></category>
		<category><![CDATA[data collection]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[Use Case]]></category>
		<category><![CDATA[aol cloudera]]></category>
		<category><![CDATA[aol data management]]></category>
		<category><![CDATA[aol hadoop]]></category>
		<category><![CDATA[AOL use case]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=8248</guid>
		<description><![CDATA[Pero works on research and development in new technologies for online advertising at Aol Advertising R&#038;D in Palo Alto. Over the past 4 years he has been the Chief Architect of R&#038;D distributed ecosystem comprising more than thousand nodes in multiple data centers. He also led large-scale contextual analysis, segmentation and machine learning efforts at [...]]]></description>
			<content:encoded><![CDATA[<p><em>Pero works on research and development in new technologies for online advertising at Aol Advertising R&#038;D in Palo Alto. Over the past 4 years he has been the Chief Architect of R&#038;D distributed ecosystem comprising more than thousand nodes in multiple data centers. He also led large-scale contextual analysis, segmentation and machine learning efforts at AOL, Yahoo and Cadence Design Systems and published patents and research papers in these areas. </em></p>
<p>A critical premise for success of online advertising networks is to successfully collect, organize, analyze and use large volumes of data for decision making. Given the nature of their online orientation and dynamics, it is critical that these processes be automated to the largest extent possible.</p>
<p>Specifically, the success of advertising technology and its impact on revenue are directly proportional to its capability to use large amounts of data in order to compute proper impression value given the unique circumstances of ad serving events such as the characteristics of the impression, the ad, and the user as well as the content and context. As a general rule, more data results in more accurate predictions.</p>
<p>In addition, to Optimization, Reporting and Analytics provide indispensable feedback to our internal Business and Sales teams helping us acquire new, and expand current, commitments from external customers.</p>
<p>At AOL, we started large-scale data collection more than 4 years ago and went from using heavily sampled data sets to being able to process full serving logs. We have been using Apache Hadoop since version 0.14 as a part of an R&amp;D effort and recently moved to Cloudera <abbr title="Cloudera's Distribution including Apache Hadoop 3">CDH3</abbr> distribution. Gradually, we introduced more systems and technologies to our ecosystem around Hadoop.</p>
<p>We chose Hadoop for several reasons:</p>
<div style="margin-left:20px">
<ul>
<li>Ability to store, organize and process large data sets</li>
<li>Great flexibility with data formats</li>
<li>Map-reduce offers flexible data processing paradigm and works well with changing data</li>
<li>Excellent cost-volume/price-performance point which proved very important in early proof-of-concept stages</li>
<li>Failure built into the system via distributed computation and data redundancy</li>
</ul>
</div>
<div style="text-align:center">
<img style="border:1px solid gray" src="https://www.cloudera.com/wp-content/uploads/2011/07/Fig1-Aol.png" alt="Line Graph Demonstrating AOL's Cluster Size [nodes] and Aggregate Disk Space [TB]" /></p>
<p style="margin-bottom:12px"><strong>Figure 1. Growth of Hadoop cluster</strong></p>
<p><img style="border:1px solid gray" src="https://www.cloudera.com/wp-content/uploads/2011/07/fig2-Aol-sampling-rate.png" alt="AOL's Sampling Rate" /></p>
<p><strong>Figure 2. Growth in sampling rate</strong></p>
</div>
<p>We show growth of our Hadoop clusters in Figure 1, and increase in the sampling rate in Figure 2. Between the 3<sup>rd</sup> and 4<sup>th</sup> iteration we switched to disks that are 4 times larger and we used 4-8 times more cores per node. The increase in the total number of CPUs was even more pronounced as we found we needed more processing power for newly developed processing flows. During the initial stages growing the sampling rate was the primary goal. As the number of processing pipelines increased, the output data volume increased. We&#8217;ve also added more external data flows. These two trends drove the increase in total storage space and processing power beyond full log samples between stages 4 and 5. Note that the impact of important factors like the business environment and team growth had significant impact on the pace of cluster upgrades.</p>
<p>At the same time, we grew the ecosystem around Hadoop to encompass other infrastructure and computational components such as databases, caching and high-performance computing clusters. As our Hadoop clusters increased in size, these clusters correspondingly increased to store and process larger data sets.</p>
<p>The main reason for the qualitative change in shifting between the 3<sup>rd</sup> and 4<sup>th</sup> iteration was the move from R&amp;D to a production environment. With involvement of additional teams we faced several challenges that Cloudera helped us with:</p>
<div style="margin-left:20px">
<ul>
<li>Specifying and executing operational requirements</li>
<li>Cluster setup</li>
<li>Staff training</li>
<li>Introducing other indispensable parts of Hadoop ecosystem such as robust data flows (Flume), monitoring and instrumentation</li>
<li>Ensuring that long-term vision and execution are aligned with Hadoop roadmap</li>
</ul>
</div>
<p>The last point is especially important as we see Hadoop as an ever-evolving data processing platform. We see ourselves as a contributor and partner in this process &#8211; through the recently introduced Cloudera Customer Council we participate in discussions and working groups. For us, this is a great learning experience which simultaneously provides ample opportunities for us to contribute to an important technology that is changing the way we do business.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/07/evolution-of-hadoop-ecosystem-aol-advertising-experience/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Migrating from Elastic MapReduce to a Cloudera&#8217;s Distribution including Apache Hadoop Cluster</title>
		<link>http://www.cloudera.com/blog/2011/06/migrating-from-elastic-mapreduce-to-a-cloudera%e2%80%99s-distribution-including-apache-hadoop-cluster/</link>
		<comments>http://www.cloudera.com/blog/2011/06/migrating-from-elastic-mapreduce-to-a-cloudera%e2%80%99s-distribution-including-apache-hadoop-cluster/#comments</comments>
		<pubDate>Wed, 22 Jun 2011 13:00:33 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Cloudera's Distribution including Apache Hadoop]]></category>
		<category><![CDATA[Elastic MapReduce]]></category>
		<category><![CDATA[Hadoop Migration]]></category>
		<category><![CDATA[Migrating to CDH]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=8057</guid>
		<description><![CDATA[This post was contributed by Jennie Cochran-Chinn and Joe Crobak. They are part of the team building out Adconion&#8216;s Hadoop infrastructure to support Adconion&#8217;s next-generation ad optimization and reporting systems. This is the first of a two part series about moving away from Amazon&#8217;s EMR service to an in-house Hadoop cluster. When we first started [...]]]></description>
			<content:encoded><![CDATA[<p><em>This post was contributed by Jennie Cochran-Chinn and Joe Crobak. They are part of the team building out <a href="http://www.adconion.com/" target="_about">Adconion</a>&#8216;s Hadoop infrastructure to support Adconion&#8217;s next-generation ad optimization and reporting systems.</em></p>
<hr />
<p><em>This is the first of a two part series about moving away from Amazon&#8217;s <abbr title="Elastic MapReduce">EMR</abbr> service to an in-house Hadoop cluster. </em></p>
<p>When we first started using Hadoop, we went down the path of Amazon&#8217;s <abbr title="Elastic MapReduce">EMR</abbr> service.&#160; We had limited operational resources and wanted to get up and running quickly.&#160; After a while, we starting hitting the limitations of EMR and had to migrate towards managing our own cluster.&#160; In doing so we did not want to lose the features of EMR we found useful &#8211; mainly the ease of cluster setup.</p>
<p><em>This first part of the series discusses our motivation for choosing and then moving away from EMR, while the second part deals with how we maintained ease of cluster setup using Puppet. </em></p>
<p>Many of our systems use Amazon&#8217;s S3 as a backup repository for log data.&#160; Our data became too large to process by traditional techniques, so we started using Amazon&#8217;s Elastic MapReduce (EMR) to do more expensive queries on our data stored in S3.&#160; The major advantage of EMR for us was the lack of operational overhead.&#160; With a simple API call, we could have a 20 or 40 node cluster running to crunch our data, which we shutdown at the conclusion of the run.</p>
<p>We had two systems interacting with EMR.&#160; The first consisted of shell scripts to start an EMR cluster, run a pig script, and load the output data from S3 into our data warehousing system.&#160; The second was a Java application that launched pig jobs on an EMR cluster via the Java API and consumed the data in S3 produced by EMR.</p>
<p>The magic of spinning up and configuring a Hadoop cluster in EC2 was spectacular, but there were a few areas that we saw room for improvement.&#160; In particular:</p>
<p><strong>Performance &amp; Tuning</strong>. We were hit by the small-files problem, lack of data locality (data stored in S3 but processed on nodes of the EMR cluster), decompression (bz2) performance issues, and virtualization penalties.&#160; To solve these problems, we decided that we needed a non-transient cluster (to satisfy data locality), and a process to aggregate our logfiles into a Hadoop-friendly size and data format (we ultimately chose avro). After crunching the numbers, it was evident that storing large amounts of data on an EC2 cluster quickly becomes expensive, and one still suffers from virtualization penalties (particularly since Hadoop is so I/O intensive), so we decided to build-out a cluster using <a href="http://www.cloudera.com/hadoop/" target="_about"><abbr title="Cloudera's Distribution including Apache Hadoop 3">CDH3</abbr></a>.</p>
<p><strong>Monitoring. </strong>Typically for us, a pig script running on EMR was one step in a workflow, so we needed to monitor the status of the job to determine when it finished and the next steps could continue.&#160; While Amazon exposes a rich API for monitoring a job, we really wanted a more generic mechanism for monitoring all steps in a workflow, not just those on an EMR cluster.&#160; After considering a number of solutions, we ultimately chose to use Azkaban as our workflow engine for managing dependencies, alerting, and monitoring (which we added atop Azkaban ourselves).</p>
<p><strong>API Access.</strong> Interacting with a cluster only over an API is both a blessing and a curse.&#160; The API takes care of otherwise complicated mechanics, such as starting, configuring, and stopping the cluster.&#160; With that said, the calls to the EMR service are rate-limited, so it doesn&#8217;t scale very well for monitoring a number of clusters.&#160; Also, we found that we could continuously keep a cluster busy, and thus the EMR limitation of 100 or so jobs on a cluster meant that we had to build wrappers to periodically shutdown and startup clusters.</p>
<p><strong>Lack of latest features.</strong> We were using Hadoop 0.18 and Pig 0.3 on EMR, which were missing many features that we wanted to try (e.g. JVM reuse, CombineInputFormats, and improved pig optimization plans).&#160; Eventually, Amazon upgraded to Hadoop 0.20 and Pig 0.6, but even at that point <a href="http://www.cloudera.com/hadoop/" target="_about">Cloudera&#8217;s Distribution including Apache Hadoop</a> had backported many useful features such as performance improvements, monitoring enhancements, and additional APIs.&#160; In addition, <abbr title="Cloudera's Distribution including Apache Hadoop">CDH</abbr> provides a full-suite of solutions including Pig, Hive, Flume, and Sqoop, that we&#8217;re either actively using or planning to use.</p>
<p>For us, the major drawback to moving away from EMR was new operational overhead.&#160; Starting a cluster with an API call is incredibly useful, and we soon discovered that <abbr title="Cloudera's Distribution including Apache Hadoop">CDH</abbr> provided scripts for doing so (now there&#8217;s something even better, Apache Whirr).&#160; Eventually, we decided to move out of the cloud, though, so we wanted to build an infrastructure for maintaining a cluster that worked regardless of the hardware configurations.&#160; The RPMs for CDH3 and the great documentation on installing and configuring <a href="http://www.cloudera.com/hadoop/" target="_about"><abbr title="Cloudera's distribution including Apache Hadoop">CDH</abbr></a> from Cloudera helped to make this project much-less intimidating.&#160; Ultimately, we built puppet modules for configuring our cluster, which we&#8217;ll talk much more about in part two of this post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/06/migrating-from-elastic-mapreduce-to-a-cloudera%e2%80%99s-distribution-including-apache-hadoop-cluster/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Biodiversity Indexing: Migration from MySQL to Hadoop</title>
		<link>http://www.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/</link>
		<comments>http://www.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/#comments</comments>
		<pubDate>Tue, 21 Jun 2011 13:00:01 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[oozie]]></category>
		<category><![CDATA[sqoop]]></category>
		<category><![CDATA[Biodiversity indexing with Hadoop]]></category>
		<category><![CDATA[Migrating to CDH]]></category>
		<category><![CDATA[Using Oozie and Sqoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=8041</guid>
		<description><![CDATA[This post was contributed by The Global Biodiversity Information Facility development team. The Global Biodiversity Information Facility is an international organization, whose mission is to promote and enable free and open access to biodiversity data worldwide. Part of this includes operating a search, discovery and access system, known as the Data Portal; a sophisticated index [...]]]></description>
			<content:encoded><![CDATA[<p><br style="_spacer" />
<p><em>This post was contributed by The Global Biodiversity Information Facility development team.</em></p>
<p>The <a target="_about" href="http://www.gbif.org/">Global Biodiversity Information Facility</a> is an international organization, whose mission is to promote and enable free and open access to biodiversity data worldwide. Part of this includes operating a search, discovery and access system, known as the <a target="_about" href="http://data.gbif.org/">Data Portal</a>; a sophisticated index to the content shared through GBIF. This content includes both complex taxonomies and occurrence data such as the recording of specimen collection events or species observations. While the taxonomic content requires careful data modeling and has its own challenges, it is the growing volume of occurrence data that attracts us to the <a target="_about" href="http://hadoop.apache.org/">Hadoop</a> stack.</p>
<p>The Data Portal was launched in 2007. It consists of crawling components and a web application, implemented in a typical Java solution consisting of <a target="_about" href="http://www.springsource.org/">Spring</a>, <a target="_about" href="http://www.hibernate.org/">Hibernate</a> and <a target="_about" href="http://static.springsource.org/spring/docs/2.0.x/reference/mvc.html">SpringMVC</a>, operating against a <a target="_about" href="http://www.mysql.com/">MySQL</a> database. In the early days the MySQL database had a very normalized structure, but as content and throughput grew, we adopted the typical pattern of <a target="_about" href="http://en.wikipedia.org/wiki/Denormalization">denormalisation</a> and <a target="_about" href="http://en.wikipedia.org/wiki/Scalability#Scale_vertically_.28scale_up.29">scaling up</a> with more powerful hardware. By the time we reached 100 million records, the occurrence content was modeled as a single fixed-width table. Allowing for complex searches containing combinations of species identifications, higher-level groupings, locality, bounding box and temporal filters required carefully selected indexes on the table. As content grew it became clear that real time indexing was no longer an option, and the Portal became a snapshot index, refreshed on a monthly basis, using complex batch procedures against the MySQL database. During this growth pattern we found we were moving more and more operations off the database to avoid locking, and instead partitioned data into delimited files, iterating over those and even performing joins using text files by synthesizing keys, sorting and managing multiple file cursors. Clearly we needed a better solution, so we began <a target="_about" href="http://biodivertido.blogspot.com/2008/11/reproducing-spatial-joins-using-hadoop.html">researching Hadoop</a>.&#160; Today we are preparing to put our first Hadoop process into production.</p>
<p>Our first objective is to <a target="_about" href="http://gbif.blogspot.com/2011/04/reworking-portal-processing.html">address the monthly processing we perform</a>. This area of work does not increase functionality offered through the portal (that will be addressed following this infrastructural work) but rather aims to:</p>
<ul>
<li>Reduce the latency between a record changing on the publisher side, and being reflected in the index</li>
<li>Reduce the amount of (wo)man-hours needed to coax through a successful processing run</li>
<li>Improve the quality assurance by inclusion of
<ul>
<li>Checking that terrestrial point locations fall within the stated country using&#160;<a target="_about" href="http://www.naturalearthdata.com/">shapefiles</a></li>
<li>Checking coastal waters using&#160;<a target="_about" href="http://www.vliz.be/vmdcdata/marbound/">Exclusive Economic Zones</a></li>
<li>Rework all the date and time handling</li>
<li>Use dictionaries (vocabularies) for interpretation of fields such as Basis of Record</li>
<li>Integrate checklists (taxonomic, nomenclatural and thematic) shared through the&#160;<a target="_about" href="http://www.gbif.org/informatics/name-services/">GBIF ECAT Programme</a> to improve the taxonomic services, and the <a target="_about" href="http://gbif.blogspot.com/2011/04/lucene-for-searching-names-in-our-new.html">backbone (&#8220;nub&#8221;) taxonomy</a>.</li>
<li>Provide a robust framework for future development</li>
<li>Allow the infrastructure to grow predictably with content and demand growth</li>
</ul>
</li>
</ul>
<p>Things have progressed significantly since the early Hadoop investigations which included hand crafting MapReduce jobs, and GBIF are now developing using the following technologies:</p>
<ul>
<li><a target="_about" href="http://hadoop.apache.org/">Apache Hadoop</a>: A distributed file system and cluster processing using the Map Reduce framework
<ul>
<li>GBIF are using the&#160;<a target="_about" href="http://www.cloudera.com/hadoop/">Cloudera&#8217;s Distribution including Apache Hadoop</a></li>
<li><a target="_about" href="http://www.cloudera.com/downloads/sqoop/">Sqoop</a>: A utility to synchronize between relational databases and Hadoop </li>
<li><a target="_about" href="http://wiki.apache.org/hadoop/Hive">Hive</a>: A data warehouse infrastructure built on top of Hadoop, and developed and open-sourced by <a target="_about" href="http://www.royans.net/arch/hive-facebook/">Facebook</a>. Hive gives SQL capabilities on Hadoop, which is particularly attractive to a development team fluent in SQL. [Full table scans on GBIF occurrence records reduce from hours to minutes]</li>
<li><a target="_about" href="http://yahoo.github.com/oozie/">Oozie</a>: An open-source workflow/coordination service to manage data processing jobs for Hadoop, developed and then open-sourced by <a target="_about" href="http://developer.yahoo.com/hadoop/">Yahoo!</a></li>
</ul>
</li>
</ul>
<p>The processing architecture is depicted:</p>
<p><img src="https://www.cloudera.com/wp-content/uploads/2011/06/oozie.png" alt="Migrating to CDH from MySQL" /></p>
<p>Following this processing work, we expect to modify our crawling to harvest directly into <a target="_about" href="http://hbase.apache.org/">HBase</a>. The flexibility HBase offers will allow us to grow incrementally the richness of the terms indexed in the Portal, while integrating nicely into Hadoop based workflows. The addition of <a target="_about" href="http://hbaseblog.com/2010/11/30/hbase-coprocessors/">coprocessors to HBase</a> is of particular interest to further reduce the latency involved in processing, by eliminating batch processing altogether.</p>
<p>The combination of Hadoop, Oozie and Hive offer a framework that we anticipate will fit nicely with many of our data transformation tasks, and Sqoop and Hive have made the technologies far more accessible to our development team than was &#160;previously possible.</p>
<p>All GBIF source code is available under open source licensing, and this work is regularly blogged on the <a target="_about" href="http://gbif.blogspot.com/">GBIF Developer Blog</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>CDH 3 Demo VM installation on Mac OS X using VirtualBox</title>
		<link>http://www.cloudera.com/blog/2011/06/cloudera-distribution-including-apache-hadoop-3-demo-vm-installation-on-mac-os-x-using-virtualbox-cdh/</link>
		<comments>http://www.cloudera.com/blog/2011/06/cloudera-distribution-including-apache-hadoop-3-demo-vm-installation-on-mac-os-x-using-virtualbox-cdh/#comments</comments>
		<pubDate>Thu, 02 Jun 2011 13:00:53 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[distribution]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[cdh3 demo vm]]></category>
		<category><![CDATA[cdh3 vm installation]]></category>
		<category><![CDATA[hadoop installation on mac os x]]></category>
		<category><![CDATA[hadoop vm installation]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=7981</guid>
		<description><![CDATA[The first task is to ensure that your system is up-to-date. This procedure has been tested on the following configuration: Fully up-to-date Snow Leopard 10.6.7 Update or install Oracle VM VirtualBox for Mac OS X to version 4.0.8 (Virtualbox 4.0.8-71778-OSX) Assumptions: The browser used is Safari. The Demo VM has been downloaded to the default [...]]]></description>
			<content:encoded><![CDATA[<h2>The first task is to ensure that your system is up-to-date.</h2>
<p><em>This procedure has been tested on the following configuration:</em></p>
<div style="margin-left:20px">
<ul>
<li>Fully up-to-date Snow Leopard 10.6.7</li>
<li>Update or install Oracle VM VirtualBox for Mac OS X to version 4.0.8 (Virtualbox 4.0.8-71778-OSX)</li>
</ul>
</div>
<h2>Assumptions:</h2>
<div style="margin-left:20px">
<ul>
<li>The browser used is Safari.</li>
<li>The Demo VM has been downloaded to the default download location for Safari (i.e. the &#8220;Downloads&#8221; folder within the users home directory).</li>
<li>The Demo VM will be run from the Downloads folder.</li>
</ul>
</div>
<hr />
<p><strong>Step 1:</strong> Download the Cloudera demo virtual machine from the ?Downloads? area of the Cloudera web site.</p>
<p align="middle"><img src="https://www.cloudera.com/wp-content/uploads/2011/06/1-blog-image.png" alt="Downloading Cloudera's Hadoop Demo VM Screenshot" width="540px" /></p>
<p>Click on the ?Download? link beside the ?Virtual Machine? section.</p>
<p>The file should automatically start to download. It will be saved to your ?Downloads? folder.</p>
<p><strong>Step 2:</strong> Once the file has downloaded you now need to decompress it. The file that the virtual machine is contained within is a bz2 archive (specifically a bzip2 compressed archive) and can be decompressed with the finder.</p>
<p>Navigate to the ?Downloads? section, and either double-click on the file or right click on the file and then select &#8216;Open With/Archive Utility (Default). This will start the decompression/unarchiving process.</p>
<p align="middle"><img width="540px" src="https://www.cloudera.com/wp-content/uploads/2011/06/02-unzipping-archive.png" alt="Unzipping Cloudera's Hadoop Demo VM Archive Screenshot" /></p>
<p><strong>Step 3:</strong> When the archive utility has finished, there will be a folder named cloudera-demo-0.3.7. The contents of which are shown below:</p>
<p align="middle"><img width="540px" src="https://www.cloudera.com/wp-content/uploads/2011/06/03-folder-contents.png" alt="Folder Contents Screenshot" /></p>
<p><strong>Step 4:</strong> Start VirtualBox and create a new VM by clicking the ?New? button. The following is the first dialog for the new virtual machine.</p>
<p align="middle"><img width="540px" src="https://www.cloudera.com/wp-content/uploads/2011/06/4-New-VM-Welcome-Screen.png" alt="New VM (Virtual Machine) Welcome Screen" /></p>
<p><strong>Step 5:</strong> Once you click on the ?Continue? button you are presented with the following dialog.</p>
<p align="middle"><img width="540px" src="https://www.cloudera.com/wp-content/uploads/2011/06/5-VM-Name-and-OS-Capture.png" alt="New Virtual Macine Name and Operating System Capture Screenshot" /></p>
<p><strong>Step 6:</strong> Give the new virtual machine a name, in this example we?ll be using ?Cloudera-CDH3?.</p>
<p>For Operating System and Version select ?Linux? and ?Ubuntu?.</p>
<p align="middle"><img width="540px" src="https://www.cloudera.com/wp-content/uploads/2011/06/6-Name-new-virtual-machine.png" alt="Name New Virtual Machine" /></p>
<p><strong>Step 7:</strong> Increase the base memory to 1024 MB (if possible) for better performance.</p>
<p align="middle"><img width="540px" src="https://www.cloudera.com/wp-content/uploads/2011/06/7-Increase-the-base-memory.png" alt="Increase the Base Memory Screenshot" /></p>
<p><strong>Step 8:</strong> In this step we need to select the training VM file we just downloaded and extracted. Click on the second radio button ?Use existing hard disk? and then click on the file folder icon with the green ^ on the right hand side of the drop down.</p>
<p>Please Note: The contents of the drop down on your system will be different than that displayed here.</p>
<p align="middle"><img width="540px" src="https://www.cloudera.com/wp-content/uploads/2011/06/8-Virtual-Hard-disk.png" alt="Cloudera's Demo Virtual Hard Disk Screenshot" /></p>
<p>Navigate to the folder where the demo VM is located, in this example it should be in the ?Downloads? folder of the user and called ?cloudera-demo-0.3.7?.</p>
<p align="middle"><img width="540px" src="https://www.cloudera.com/wp-content/uploads/2011/06/8.1-Virtual-Hard-Disk-Selection.png" alt="Virtual Hard Disk Selection" /></p>
<p><strong>Step 9:</strong> Click the ?Continue? button and the following dialog is now displayed and it shows a summary of the choices made so far.</p>
<p align="middle"><img width="540px" src="https://www.cloudera.com/wp-content/uploads/2011/06/9-Summary-of-new-virtual-machine.png" alt="Summary of New Virtual Machine" /></p>
<p><strong>Step 10:</strong> Click the ?Done? button to finish the creation of the virtual machine in VirtualBox 4. The following shows the newly created cloudera-CDH3 VM in the VirtualBox manager screen.</p>
<p align="middle"><img width="540px" src="https://www.cloudera.com/wp-content/uploads/2011/06/10-Finish-Creating-VM.png" alt="Finish Creating New Virtual Machine Screenshot" /></p>
<p><strong>Step 11:</strong> Start the virtual machine. When the Demo VM launches you should be presented with the following login.</p>
<p align="middle"><img width="540px" src="https://www.cloudera.com/wp-content/uploads/2011/06/12-vm-login.png" alt="Virtual Machine Login" /></p>
<p><strong>Step 12:</strong> Finally, once you have successfully logged into the Demo VM, the image below is the initial view that you should see.</p>
<p align="middle"><img width="540px" src="https://www.cloudera.com/wp-content/uploads/2011/06/13-vm-running.png" alt="Virtual Machine Running" /></p>
<h3>About the author:</h3>
<p><em>John Zanchetta heads up an integration test team in the mobile telecommunications space and is experimenting with <abbr title="Cloudera's Distribution including Apache Hadoop">CDH</abbr> from a number of different perspectives. He has been hacking in various technology pools for the past ~20+ years. John can be contacted at johnzan at gmail dot com.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/06/cloudera-distribution-including-apache-hadoop-3-demo-vm-installation-on-mac-os-x-using-virtualbox-cdh/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Using Hadoop to Measure Influence</title>
		<link>http://www.cloudera.com/blog/2011/05/using-hadoop-to-measure-influence/</link>
		<comments>http://www.cloudera.com/blog/2011/05/using-hadoop-to-measure-influence/#comments</comments>
		<pubDate>Sun, 15 May 2011 13:00:14 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[hadoop influence]]></category>
		<category><![CDATA[hadoop social media]]></category>
		<category><![CDATA[hadoop twitter]]></category>
		<category><![CDATA[hadoop use case]]></category>
		<category><![CDATA[klout]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=7918</guid>
		<description><![CDATA[Background Klout&#8217;s goal is to be the standard for influence. The advent of social media has created a huge number of measurable relationships. On Facebook, people have an average of 130 friends. On Twitter, the average number of followers range from 300+ to 1000+. With each relationship comes a different source of data. This has [...]]]></description>
			<content:encoded><![CDATA[<h2>Background</h2>
<p><a href="http://www.klout.com" target="_about">Klout&#8217;s</a> goal is to be the standard for influence. The advent of social media has created a huge number of measurable relationships. On Facebook, people have an average of 130 friends. On Twitter, the average number of followers range from 300+ to 1000+. With each relationship comes a different source of data. This has created A LOT of noise and an attention economy. Influence has the power to drive this attention.</p>
<p>When a company, brand, or person creates content, our goal is to measure the actions on that content. We want to measure every view, click, like, share, comment, retweet, mention, vote, check-in, recommendation, and so on. We want to know how influential the person who *acted* on that content is. We want to know the actual meaning of that content. And we want to know all of this, over time.</p>
<p>Measuring influence is a bit like trying to measure an emotion like hate or jealousy. It&#8217;s really hard and takes a boatload of data.</p>
<p>A huge part of what we do is develop machine learning models that make sense of this data. On top of that, there&#8217;s an endless amount of this data and we need a platform to ingest, prepare, and analyze it.</p>
<p>The two biggest platforms are Facebook and Twitter, but it hardly ends there when it comes to social media. There&#8217;s LinkedIn, Foursquare, Path, Youtube, Quora, and many others. This presents the challenge of creating models for each platform and building data analysis platforms that can handle unstructured data.</p>
<p>To handle this at Klout, we&#8217;ve turned to open source technologies such as Hadoop. Specifically, we turned to <a href="http://www.cloudera.com/hadoop/" target="_about">Cloudera&#8217;s CDH3 distribution</a> due to ease of installation and availability of enterprise support.</p>
<h2>Twitter Influence</h2>
<p>Twitter was the natural selection for our first network to analyze due to the open nature of the data as well as the simplistic nature of actions you can take on Twitter, such as a mention or a retweet.</p>
<p>However, as our models matured, the growth of Twitter increased. As of this post, our Twitter cluster has the following stats:</p>
<div style="margin-left:20px">
<ul>
<li>75 million people scored daily</li>
<li>4 billion graph edges scored daily</li>
<li>48 million people are influenced by or influence an average of 27 people</li>
<li>We derive hundreds of thousands of different topics that 14 million users are influential<br />
on</li>
<li>On average 5 topics per user using NLP and semantic analysis</li>
<li>For topics, 3 months of mentions and retweets are analyzed, currently over 6 billion</li>
</ul>
</div>
<p><img src="https://www.cloudera.com/wp-content/uploads/2011/05/Klout-image-1.png" alt="Klout's Twitter Analytics" style="align:center" /></p>
<h2 style="text-align:center">Twitter Analytics Overview</h2>
<p>From the twitter firehose, data is written to disk in buffered chunks. A mapreduce job handles the task of preparing the firehose data into different buckets needed for each of the workflows. These different workflows serve different products from performing bot and spam detection to scoring to topic extraction.</p>
<p>Many of our mapreduce jobs are written in java, but we also rely on Pig Latin for some purposes such as performing simple joins are population aggregates and statistics.</p>
<p>Oozie is used to coordinate the different workflow components. To serve out data both internally and externally, we dump out raw csv files or load this data into HBase which interfaces with load balanced API servers.</p>
<p><img src="https://www.cloudera.com/wp-content/uploads/2011/05/Klout-image-2.png" alt="Klout's Twitter Scoring Workflow" style="align:center" /></p>
<h2 style="text-align:center">Twitter Scoring Workflow</h2>
<p>We use a machine learning and statistical based approach to perform our scoring. This model currently has over 35 features. The scoring workflow consists of different Oozie jobs, many of which perform feature extraction. In the final jobs of this workflow, all the features are fed into the scoring model, which produces scores.</p>
<p>We&#8217;ve experimented with Mahout in the past and we will be using more of it in the future.</p>
<h2>Challenges</h2>
<p>Having a highly available API is one of our key goals. However, when we refresh 75 million scores + meta data daily, it becomes challenging to flip a switch to make all the new data available. This led to us having multiple clusters. When one cluster is loading data, the load balanced API servers are aware of each cluster&#8217;s status, and switches to the non-loading cluster. This also mitigates any performance issues due to splits, minor and major compactions on the clusters. This also allowed us to cope with instabilities caused by cloud instances in unpredictable states.</p>
<p>That said, we are in the process of building out our own servers and racks at a nearby facility. We&#8217;ve also had issues where our edits logs for namenodes get corrupted due to server instabilities. This is where Cloudera has come to our rescue. We initially had to manually apply patches and build hadoop-core jars ourselves to resolve such problems, but with <a href="http://www.cloudera.com/hadoop/" target="_about">Cloudera&#8217;s Distribution including Apache Hadoop</a> and their expert Solution Architects help this is no longer an issue. We now are able to focus our resources on our products.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/05/using-hadoop-to-measure-influence/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Three Reasons Why Apache Avro Data Serialization is a Good Choice for OpenRTB</title>
		<link>http://www.cloudera.com/blog/2011/05/three-reasons-why-apache-avro-data-serialization-is-a-good-choice-for-openrtb/</link>
		<comments>http://www.cloudera.com/blog/2011/05/three-reasons-why-apache-avro-data-serialization-is-a-good-choice-for-openrtb/#comments</comments>
		<pubDate>Fri, 13 May 2011 18:26:13 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[guest]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=7937</guid>
		<description><![CDATA[This is a guest repost from the DataXu blog. Click here to view the original post. I recently evaluated several serialization frameworks including Thrift, Protocol Buffersand Avro for a solution to address our needs as a demand side platform, but also for a protocol framework to use for the OpenRTB marketplace as well. The working [...]]]></description>
			<content:encoded><![CDATA[<p><i>This is a guest repost from the <a href="http://www.dataxu.com/" target="_about">DataXu</a> blog. Click <a href="http://www.dataxu.com/2011/05/three-reasons-why-apache-avro-data-serialization-is-a-good-choice-for-openrtb/" target="_about">here</a> to view the original post.</i></p>
<p>I recently evaluated several serialization frameworks including<a href="http://thrift.apache.org/" target="_about"> Thrift</a>, <a href="http://code.google.com/p/protobuf/" target="_about">Protocol Buffers</a>and <a href="http://avro.apache.org/" target="_about">Avro</a> for a solution to address our needs as a demand side platform, but also for a protocol framework to use for the <a href="http://openrtb.info/" target="_about">OpenRTB</a> marketplace as well. The working draft of OpenRTB 2.0 uses simple <a href="http://www.json.org/" target="_about">JSON</a> encoding, which has many advantages including simplicity and ubiquity of support. Many OpenRTB contributors requested we support at least one binary standard as well, to improve bandwidth usage and CPU processing time for real-time bidding at scale.</p>
<p>After reviewing many candidates, <a href="http://avro.apache.org/docs/current/" target=_about">Apache Avro</a> proved to be the best solution.</p>
<p><a href="http://avro.apache.org/" target="_about"><img src="https://www.cloudera.com/wp-content/uploads/2011/05/Avro-Image.png" style="float:right;margin-left:8px" alt="Apache Avro" /></a></p>
<p>To demonstrate what differentiates Avro from the other frameworks (the link to my source code is at the end of this post), I put together a quick test of key features. The following are the key advantages of Avro 1.5:</p>
<p>* <strong>Schema evolution</strong> &#8211; Avro requires schemas when data is written or read. Most interesting is that you can use different schemas for serialization and deserialization, and Avro will handle the missing/extra/modified fields.</p>
<p>* <strong>Untagged data </strong>&#8211; Providing a schema with binary data allows each datum be written without overhead. The result is more compact data encoding, and faster data processing.</p>
<p>*<strong> Dynamic typing</strong> &#8211; This refers to serialization and deserialization without code generation. It complements the code generation, which is available in Avro for statically typed languages as an optional optimization.</p>
<h2>Schema Evolution</h2>
<p>This is the most exciting feature! It allows for building less decoupled and more robust systems. Below, I made significant changes to the schema, and things still work fine. This flexibility is a very interesting feature for rapidly evolving protocols like OpenRTB.</p>
<p>The following example demonstrates how this works.   First, I created a new (example) schema. (Avro schemas are defined in JSON):</p>
<p><pre class="code">{
    "type": "record",
    "name": "Employee",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "age", "type": "int"},
        {"name": "emails", "type": {"type": "array", "items": "string"}},
        {"name": "boss", "type": ["Employee","null"]}
    ]
}</pre>
</p>
<p>Next, I serialized a few records into a binary file using that schema. After that, I evolved my schema to the following:</p>
<p><pre class="code">{
    "type": "record",
    "name": "Employee",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "yrs", "type": "int", "aliases": ["age"]},
        {"name": "gender", "type": "string", "default":"unknown"},
        {"name": "emails", "type": {"type": "array", "items": "string"}}
    ]
}</pre>
</p>
<p>This is a snapshot of the changes I made to the schema:</p>
<p>1) Renamed the field &#8216;age&#8217; to &#8216;yrs&#8217;. Thanks to the alias feature, I can retrieve the value of &#8216;age&#8217; by using the field name &#8216;yrs&#8217;.</p>
<p>2) Added a new &#8216;gender&#8217; field, and defined a default value for it. This can be used to set values during deserialization as this field isn&#8217;t present in the original schema records.</p>
<p>3) Removed the &#8216;boss&#8217; field.</p>
<p>Finally, I deserialized the binary data file with this new schema, and print it out. Success!</p>
<h2>Untagged Data</h2>
<p>There are two ways to encode data when serializing with Avro: binary or JSON. In the binary file, the schema is included at the beginning of file. I verified that the binary data was serialized untagged, which resulted in a smaller footprint. Another interesting point is that the schema can be defined, and then the data can be encoded/decoded in JSON; allowing you to define a schema for JSON rich data structures. Anyone needing to implement validation for a JSON protocol (like we did for OpenRTB) will appreciate this feature. And switching between binary and JSON encoding is simply a one-line code change. Switching JSON protocol to a binary format in order to achieve better performance is pretty straightforward with Avro.</p>
<h2>Dynamic Typing</h2>
<p>The key abstraction is GenericData.Record. This is essentially a set of name-value pairs where name is the field name, and value is one of the Avro supported value types. I found the dynamic typing to be very easy to use. When a generic record is instantiated, you have to provide a JSON-encoded schema definition. To access the fields, just use put/get methods like you would with any map. This approach is referred to as &#8220;generic&#8221; in Avro, in contrast to the &#8220;static&#8221; code generation approach also supported by Avro. The extra flexibility of the generic data handling has performance implications. But, this excellent benchmark &#8211; <a href="https://github.com/eishay/jvm-serializers/wiki/" target="_about">https://github.com/eishay/jvm-serializers/wiki/</a> &#8211; shows the penalty is minor, and the benefit is a simplified code base.</p>
<p>In conclusion, Avro is a unique serialization framework that works, although it took a bit of experimentation to get the code working. If you are interested in my Java code for an example of how Avro can be used, you can find it here: <a href="https://github.com/rfoldes/Avro-Test" target="_about">https://github.com/rfoldes/Avro-Test</a>.</p>
<p>Robert Foldes</p>
<p>Senior Architect, <a href="http://www.dataxu.com/" target="_about">DataXu</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/05/three-reasons-why-apache-avro-data-serialization-is-a-good-choice-for-openrtb/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

