<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; Avro</title>
	<atom:link href="http://www.cloudera.com/blog/category/avro/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Thu, 24 May 2012 17:53:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Apache Avro at RichRelevance</title>
		<link>http://www.cloudera.com/blog/2011/12/apache-avro-at-richrelevance/</link>
		<comments>http://www.cloudera.com/blog/2011/12/apache-avro-at-richrelevance/#comments</comments>
		<pubDate>Thu, 22 Dec 2011 13:00:40 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[Apache Avro]]></category>
		<category><![CDATA[Guest Post]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=10068</guid>
		<description><![CDATA[This is a guest post from RichRelevance Principal Architect and Apache Avro PMC Chair Scott Carey. In Early 2010 at RichRelevance, we were searching for a new way to store our long lived data that was compact, efficient, and maintainable over time. We had been using Hadoop for about a year, and started with the [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is a guest post from RichRelevance Principal Architect and Apache Avro PMC Chair Scott Carey.</em></p>
<p>In Early 2010 at <a href="http://www.richrelevance.com/" target="_blank">RichRelevance</a>, we were searching for a new way to store our long lived data that was compact, efficient, and maintainable over time.  We had been using Hadoop for about a year, and started with the basics &#8211; text formats and SequenceFiles.  Neither of these were sufficient.  Text formats are not compact enough, and can be painful to maintain over time.  A basic binary format may be more compact, but it has the same maintenance issues as text.  Furthermore, we needed rich data types including lists and nested records.</p>
<p>After analysis similar to <a href="http://www.cloudera.com/blog/2011/07/avro-data-interop/" target="_blank">Doug Cutting&#8217;s blog post</a>, we chose <a href="http://avro.apache.org/" target="_blank">Apache Avro</a>.  As a result we were able to eliminate manual version management, reduce joins during data processing, and adopt a new vision for what data belongs in our event logs.  On Cyber Monday 2011, we logged 343 million page view events, and nearly 100 million other events into Avro data files.</p>
<h2>Avoiding Version Management Baggage</h2>
<p>Have you ever seen code for manual serialization version management like the below?</p>
<pre class="code">
int version = input.readInt();
this.name = input.readName();
this.age = input.readInt();
if (version >= 2) {
  this.favoriteColor = input.readString();
} else {
  this.favoriteColor = "";
}</pre>
</p>
<p>Manual version management is painful.  If you evolve what data you store continuously, it does not take long to end up with dozens of versions. In order to read every version that has been written and stored, your code has to carry a lot of baggage. </p>
<p>With Avro, you can avoid writing code like the above.  The concept is simple.  Store the schema used to write your data along with your data, and use it to make the written data conform to the schema that the reader expects.  If a field is missing, use the default.  If has been removed or moved, handle it.</p>
<p>Over the last two years, we have doubled the complexity of our our page view schema across about 15 schema versions.  There is not one line of code that deals with version management, and the current code can read any of the data written over that time.  What may be a surprise to some is that our old code can read newly written data as well.  The data is both forward and backward compatible, within the rules in the <a href="http://avro.apache.org/docs/current/spec.html#Schema+Resolution" target="_blank">Avro Specification</a>.</p>
<h2>Leveraging Complex Data Types</h2>
<p>Avro supports complex data types such as arrays, maps, enumerations, and nested records.  The Avro data model makes it possible to serialize any non-recursive data structure, including trees and heterogeneous lists.  We use this property to describe our events using Avro Schemas that map to natural object representations on our front end servers.  For example, one of the elements in a page view is an array of product recommendation sets, each set containing a list of products displayed.  Another element in a page view is what we call a page context &#8211; each type of page on a merchant&#8217;s site has a unique context that differs from other page types.  A product page context is the product being displayed.  A search page context is the search terms in the search query.  There are about 30 page context types, and we represent the range of page context possibilities using an Avro union, so that all of these different event variations can be written in the same format and to the same log.</p>
<p>With a simpler data model, one might have had to log each context type separately, making it harder to get a full picture of what happened in a single request during analysis.</p>
<h2>A New Vision for an Event Log</h2>
<p>With the above properties of Avro, we were able to formulate a new vision for what an event log should be. The new model has the following properties:</p>
<ul>
<li>A singe HTTP request creates a single, atomic log event defined by an Avro schema</li>
<li>The event contains all of the resolved request inputs</li>
<li>The event contains the result of any decisions made during the request</li>
</ul>
<p>Together, these imply that it is never necessary to join different sets of data together to reconstruct what happened in an individual request during analysis.  This also significantly reduces the value of data contained in raw HTTP logs, since the Avro based logs become the origin for all major processing.  Since raw HTTP logs are significantly larger than compressed binary structured data, this significantly reduces the size of data we must keep for long periods of time.</p>
<h2>More Avro at RichRelevance</h2>
<p>We have built Hive and Pig adapters to map our Avro data into these tools for ad-hoc queries and automated tasks.  Additionally, we leverage the same Avro schemas from our log files to store click streams in HBase.  We also use Avro to store data compactly in key-value stores. </p>
<p>The log file example is what I call a schema first use case of Avro, where we define a schema for log events that can be used across different systems over a long period of time.  An alternative usage style is what I call code first, where you start with code and bind that to a serialization with a less schema-centric view.  I feel that the code first usage style is more applicable for data that lives for short or medium time scales, such as with RPC or MapReduce intermediates.  We will be deepening our investment in Avro and using it with code first use cases in the future, in the process working with the community to improve the developer experience for those use cases.</p>
<p>Avro is a growing, evolving project that I see as more broad than a serialization framework.  At heart Avro is about applying a schema to data, in order to manipulate that data in well defined ways.  Serialization, validation, and transformation are only some of the operations you can apply to data that conforms to a known schema.  Over time the project will grow to have more and more functionality centered around operations you can apply to data that conforms to an Avro Schema.  I look forward to working with the Avro community as the project continues to evolve!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/12/apache-avro-at-richrelevance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Apache Flume &#8211; Architecture of Flume NG</title>
		<link>http://www.cloudera.com/blog/2011/12/apache-flume-architecture-of-flume-ng-2/</link>
		<comments>http://www.cloudera.com/blog/2011/12/apache-flume-architecture-of-flume-ng-2/#comments</comments>
		<pubDate>Fri, 09 Dec 2011 19:22:27 +0000</pubDate>
		<dc:creator>Arvind Prabhakar</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[data collection]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[Flume]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[flume-ng]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9884</guid>
		<description><![CDATA[This blog was originally posted on the Apache Blog: https://blogs.apache.org/flume/entry/flume_ng_architecture Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Flume is currently undergoing incubation at The Apache Software Foundation. More information on this project can [...]]]></description>
			<content:encoded><![CDATA[<p><em>This blog was originally posted on the Apache Blog: <a href="https://blogs.apache.org/flume/entry/flume_ng_architecture" target="_blank">https://blogs.apache.org/flume/entry/flume_ng_architecture</a></em></p>
<p>Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Flume is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at <a href="http://incubator.apache.org/flume">http://incubator.apache.org/flume</a>. <em>Flume NG</em> is work related to new major revision of Flume and is the subject of this post.</p>
<p>Prior to entering the incubator, Flume saw incremental releases leading up to version 0.9.4. As Flume became adopted it became clear that certain design choices would need to be reworked in order to address problems reported in the field. The work necessary to make this change began a few months ago under the JIRA issue <a href="https://issues.apache.org/jira/browse/FLUME-728">FLUME-728</a>. This work currently resides on a separate branch by the name flume-728, and is informally referred to as Flume NG. At the time of writing this post Flume NG had gone through two internal milestones &#8211; <em>NG Alpha 1</em>, and <em>NG Alpha 2</em> and a formal incubator release of Flume NG is in the works.</p>
<p>At a high-level, Flume NG uses a single-hop message delivery guarantee semantics to provide end-to-end reliability for the system. To accomplish this, certain new concepts have been incorporated into its design, while certain other existing concepts have been either redefined, reused or dropped completely.</p>
<p>In this blog post, I will describe the fundamental concepts incorporated in Flume NG and talk about it&#8217;s high-level architecture. This is a first in a series of blog posts by Flume team that will go into further details of it&#8217;s design and implementation.</p>
<h2>Core Concepts</h2>
<p>The purpose of Flume is to provide a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. The architecture of Flume NG is based on a few concepts that together help achieve this objective. Some of these concepts have existed in the past implementation, but have changed drastically. Here is a summary of concepts that Flume NG introduces, redefines, or reuses from earlier implementation:</p>
<ul>
<li><strong>Event:</strong> A byte payload with optional string headers that represent the unit of data that Flume can transport from it&#8217;s point of origination to it&#8217;s final destination.</li>
<li><strong>Flow:</strong> Movement of events from the point of origin to their final destination is considered a data flow, or simply flow. This is not a rigorous definition and is used only at a high level for description purposes. </li>
<li><strong>Client:</strong> An interface implementation that operates at the point of origin of events and delivers them to a Flume agent. Clients typically operate in the process space of the application they are consuming data from. For example, Flume Log4j Appender is a client.</li>
<li><strong>Agent: </strong>An independent process that hosts flume components such as sources, channels and sinks, and thus has the ability to receive, store and forward events to their next-hop destination. </li>
<li><strong>Source:</strong> An interface implementation that can consume events delivered to it via a specific mechanism. For example, an Avro source is a source implementation that can be used to receive Avro events from clients or other agents in the flow. When a source receives an event, it hands it over to one or more channels.</li>
<li><strong>Channel:</strong> A transient store for events, where events are delivered to the channel via sources operating within the agent. An event put in a channel stays in that channel until a sink removes it for further transport. An example of channel is the JDBC channel that uses a file-system backed embedded database to persist the events until they are removed by a sink. Channels play an important role in ensuring durability of the flows.</li>
<li><strong>Sink: </strong>An interface implementation that can remove events from a channel and transmit them to the next agent in the flow, or to the event&#8217;s final destination. Sinks that transmit the event to it&#8217;s final destination are also known as terminal sinks. The Flume HDFS sink is an example of a terminal sink. Whereas the Flume Avro sink is an example of a regular sink that can transmit messages to other agents that are running an Avro source.</li>
</ul>
<p>These concepts help in simplifying the architecture, implementation, configuration and deployment of Flume.</p>
<h2>Flow Pipeline</h2>
<p>A flow in Flume NG starts from the client. The client transmits the event to it&#8217;s next hop destination. This destination is an agent. More precisely, the destination is a source operating within the agent. The source receiving this event will then deliver it to one or more channels. The channels that receive the event are drained by one or more sinks operating within the same agent. If the sink is a regular sink, it will forward the event to it&#8217;s next-hop destination which will be another agent. If instead it is a terminal sink, it will forward the event to it&#8217;s final destination. Channels allow the decoupling of sources from sinks using the familiar producer-consumer model of data exchange. This allows sources and sinks to have different performance and runtime characteristics and yet be able to effectively use the physical resources available to the system.</p>
<p>Figure 1 below shows how the various components interact with each other within a flow pipeline.</p>
<p style="text-align: center"><img class="aligncenter" src="https://blogs.apache.org/flume/mediaresource/ab0d50f6-a960-42cc-971e-3da38ba3adad" alt="Schematic showing logical components in a flow. The arrows represent the direction in which events travel across the system. This also illustrates how flows can fan-out by having one source write the event out to multiple=" /></p>
<p style="text-align: center"><strong><em>Figure 1:</em></strong><em> Schematic showing logical components in a flow. The arrows represent the direction in which events travel across the system. This also illustrates how flows can fan-out by having one source write the event out to multiple channels.</em></p>
<p>By configuring a source to deliver the event to more than one channel, flows can fan-out to more than one destination. This is illustrated in Figure 1 where the source within the operating Agent writes the event out to two channels &#8211; Channel 1 and Channel 2.</p>
<p>Conversely, flows can be converged by having multiple sources operating within the same agent write to the same channel. A example of the physical layout of a converging flow is show in Figure 2 below.</p>
<p style="text-align: center"><img class="aligncenter" src="https://blogs.apache.org/flume/mediaresource/268bf8db-43c7-497b-a0ef-63c482371eef" alt="A simple converging flow on Flume NG." width="500" height="343" /></p>
<p style="text-align: center"><em><strong>Figure 2:</strong> A simple converging flow on Flume NG.</em></p>
<h2>Reliability and Failure Handling</h2>
<p>Flume NG uses channel-based transactions to guarantee reliable message delivery. When a message moves from one agent to another, two transactions are started, one on the agent that delivers the event and the other on the agent that receives the event. In order for the sending agent to commit it&#8217;s transaction, it must receive success indication from the receiving agent. The receiving agent only returns a success indication if it&#8217;s own transaction commits properly first. This ensures guaranteed delivery semantics between the hops that the flow makes. Figure 3 below shows a sequence diagram that illustrates the relative scope and duration of the transactions operating within the two interacting agents.</p>
<p style="text-align: center"><img class="aligncenter" src="https://blogs.apache.org/flume/mediaresource/a15d9347-da9e-4824-b45f-6c00f0720590" alt="Transactional exchange of events between agents." width="500" height="329" /></p>
<p style="text-align: center"><em><strong>Figure 3:</strong> Transactional exchange of events between agents.</em></p>
<p>This mechanism also forms the basis for failure handling in Flume NG. When a flow that passes through many different agents encounters a communication failure on any leg of the flow, the affected events start getting buffered at the last unaffected agent in the flow. If the failure is not resolved on time, this may lead to the failure of the last unaffected agent, which then would force the agent before it to start buffering the events. Eventually if the failure occurs when the client transmits the event to its first-hop destination, the failure will be reported back to the client which can then allow the application generating the events to take appropriate action.</p>
<p>On the other hand, if the failure is resolved before the first-hop agent fails, the buffered events in various agents downstream will start draining towards their destination. Eventually the flow will be restored to its original characteristic throughput levels. Figure 4 below illustrates a scenario where a flow comprising of two intermediary agents between the client and the central store go through a transient failure. The failure occurs between agent 2 and the central store, resulting in the events getting buffered at the agent 2 itself. Once the failing link has been restored to normal, the buffered events drain out to the central store and the flow is restored to its original throughput characteristics.</p>
<p style="text-align: center"><img class="aligncenter" src="https://blogs.apache.org/flume/mediaresource/ac9d1c83-1089-4730-9546-fe8de509b34c" alt="Failure handling in flows. " width="500" height="352" /></p>
<p style="text-align: center"><em><strong>Figure 4: </strong>Failure handling in flows. In (a) the flow is normal and events can travel from the client to the central store. In (b) a communication failure occurs between Agent 2 and the event store resulting in events being buffered on Agent 2. In (c) the cause of failure was addressed and the flow was restored and any events buffered in Agent 2 were drained to the store.</em></p>
<h2>Wrapping up</h2>
<p>In this post I described the various concepts that are a part of Flume NG and its high-level architecture. This is first of a series of posts from the Flume team that will highlight the design and implementation of this system. In the meantime, if you need anymore information, please feel free to drop an email on the project&#8217;s user or developer lists, or alternatively file the appropriate JIRA issues. Your contribution in any form is welcome on the project.</p>
<h2>Links:</h2>
<p>Project Website: <a target="_blank" href="http://incubator.apache.org/flume/">http://incubator.apache.org/flume/</a></p>
<p>Flume NG Getting Started Guide: <a target="_blank" href="https://cwiki.apache.org/confluence/display/FLUME/Getting+Started">https://cwiki.apache.org/confluence/display/FLUME/Getting+Started</a></p>
<p>Mailing Lists: <a target="_blank" href="http://incubator.apache.org/flume/mail-lists.html">http://incubator.apache.org/flume/mail-lists.html</a></p>
<p>Issue Tracking: <a target="_blank" href="https://issues.apache.org/jira/browse/FLUME">https://issues.apache.org/jira/browse/FLUME</a></p>
<p>IRC Channel: #flume on irc.freenode.net</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/12/apache-flume-architecture-of-flume-ng-2/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Hadoop World 2011: A Glimpse into Development</title>
		<link>http://www.cloudera.com/blog/2011/10/hadoop-world-2011-a-glimpse-into-development/</link>
		<comments>http://www.cloudera.com/blog/2011/10/hadoop-world-2011-a-glimpse-into-development/#comments</comments>
		<pubDate>Wed, 12 Oct 2011 13:00:42 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[careers]]></category>
		<category><![CDATA[CDH]]></category>
		<category><![CDATA[Cloudera's Service and Configuration Manager]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[Connector]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[Flume]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[oozie]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[sqoop]]></category>
		<category><![CDATA[training]]></category>
		<category><![CDATA[Use Case]]></category>
		<category><![CDATA[ZooKeeper]]></category>
		<category><![CDATA[hadoop conference]]></category>
		<category><![CDATA[hadoop event]]></category>
		<category><![CDATA[hadoop world]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9240</guid>
		<description><![CDATA[The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hadoopworld.com/"><img style="float: left; padding-right: 20px;" title="Register for Hadoop World" src="https://www.cloudera.com/wp-content/uploads/2010/08/hw_white2.gif" alt="" /></a></p>
<p>The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.</p>
<h2 style="font-size: 14pt; color: #344152;"><a href="http://www.hadoopworld.com/tracks/development-developers/" target="_blank">Preview of Development Track Sessions</a></h2>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Building Web Analytics Processing on Hadoop at CBS Interactive</span></a><br />
 <em>Michael Sun, CBS Interactive</em></p>
<p><strong>Abstract:</strong> CBS Interactive successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties that CBS Interactive oversees. After introducing Lumberjack&#8212;the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release&#8212;Michael will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, CBS Interactive achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).</p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Gateway: Cluster Virtualization Framework</span></a><br />
<em>Konstantin Shvachko, eBay</em></p>
<p><strong>Abstract:</strong> Access to Hadoop clusters through dedicated portal nodes (typically located behind firewalls and performing user authentication and authorization) can have several drawbacks &#8212; as shared multitenant resources they can create contention among users and increase the maintenance overhead for cluster administrators. This session will discuss the Gateway system, a cluster virtualization framework that provides multiple benefits: seamless access from users&#8217; workplace computers through corporate firewalls; the ability to failover to active clusters for scheduled or unscheduled downtime, as well as the ability to redirect traffic to other clusters during upgrades; and user access to clusters running different versions of Hadoop. </p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">SHERPASURFING &#8211; Open Source Cyber Security Solution</span></a><br />
<em>Wayne Wheeles, Novii Design</em></p>
<p><strong>Abstract:</strong> Every day billions of packets, both benign and some malicious, flow in and out of networks. Every day it is an essential task for the modern Defensive Cyber Security Organization to be able to reliably survive the sheer volume of data, bring the NETFLOW data to rest, enrich it, correlate it and perform. SHERPASURFING is an open source platform built on the proven Cloudera&#8217;s Distribution including Apache Hadoop that enables organizations to perform the Cyber Security mission and at scale at an affordable price point. This session will include an overview of the solution and components, followed by a demonstration of analytics. </p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools</span></a><br />
<em>Arvind Prabhakar, Cloudera<br />
Guy Harrison, Quest Software</em></p>
<p><strong>Abstract:</strong> As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative. We&#8217;ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we&#8217;ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc. </p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Next Generation Apache Hadoop MapReduce</span></a><br />
<em>Mahadev Konar, Hortonworks</em></p>
<p><strong>Abstract:</strong> The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale, high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization. We will be presenting the architecture and design of the next generation of map reduce and will delve into the details of the architecture that makes it much easier to innovate. We will also be presenting large scale and small scale comparisons on some benchmarks with MRV1.&#8221; </p>
<p><a href="http://www.hadoopworld.com/"><img title="Register for Hadoop World" src="https://www.cloudera.com/wp-content/uploads/2010/12/registernow.gif" alt="Register for Hadoop World" /></a></p>
<p>There are several <a href="http://www.hadoopworld.com/training/">training classes</a> and <a href="http://www.hadoopworld.com/training/">certification sessions</a> provided surrounding the Hadoop World conference. Don&#8217;t forget to register and become <a href="http://www.hadoopworld.com/training/">Cloudera Certified in Apache Hadoop</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/10/hadoop-world-2011-a-glimpse-into-development/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Introducing Crunch: Easy MapReduce Pipelines for Hadoop</title>
		<link>http://www.cloudera.com/blog/2011/10/introducing-crunch/</link>
		<comments>http://www.cloudera.com/blog/2011/10/introducing-crunch/#comments</comments>
		<pubDate>Mon, 10 Oct 2011 17:05:44 +0000</pubDate>
		<dc:creator>Josh Wills</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9165</guid>
		<description><![CDATA[As a data scientist at Cloudera, I work with customers across a wide range of industries that use Hadoop to solve their business problems. Many of the solutions we create involve multi-stage pipelines of MapReduce jobs that join, clean, aggregate, and analyze enormous amounts of data. When working with log files or relational database tables, [...]]]></description>
			<content:encoded><![CDATA[<p>As a data scientist at Cloudera, I work with customers across a wide range of industries that use Hadoop to solve their business problems. Many of the solutions we create involve multi-stage pipelines of MapReduce jobs that join, clean, aggregate, and analyze enormous amounts of data. When working with log files or relational database tables, we use high-level tools like <a href="http://pig.apache.org/" target="_blank">Pig</a> and <a href="http://hive.apache.org/" target="_blank">Hive</a> for their convenient and powerful support for creating pipelines over structured and semi-structured records.</p>
<p>As Hadoop has spread from web companies to other industries, the variety of data that is stored in HDFS has expanded dramatically. Hadoop clusters are being used to process satellite images, time series data, audio files, and seismograms. These formats are not a natural fit for the data schemas imposed by Pig and Hive, in the same way that structured binary data in a relational database can be a bit awkward to work with. For these use cases, we either end up writing large, custom libraries of user-defined functions in Pig or Hive, or simply give up on our high-level tools and go back to writing MapReduces in Java. Either of these options is a serious drain on developer productivity.</p>
<p>Today, we&#8217;re pleased to introduce <a href="http://github.com/cloudera/crunch" target="_blank">Crunch</a>, a Java library that aims to make writing, testing, and running MapReduce pipelines easy, efficient, and even fun. Crunch&#8217;s design is modeled after <a href="http://dl.acm.org/citation.cfm?id=1806638" target="_blank">Google&#8217;s FlumeJava</a>, focusing on a small set of simple primitive operations and lightweight user-defined functions that can be combined to create complex, multi-stage pipelines. At runtime, Crunch compiles the pipeline into a sequence of MapReduce jobs and manages their execution.</p>
<h2>Example</h2>
<p>Let&#8217;s take a look at the classic WordCount MapReduce, written using Crunch:</p>
<pre>
import com.cloudera.crunch.DoFn;
import com.cloudera.crunch.Emitter;
import com.cloudera.crunch.PCollection;
import com.cloudera.crunch.PTable;
import com.cloudera.crunch.Pipeline;
import com.cloudera.crunch.impl.mr.MRPipeline;
import com.cloudera.crunch.lib.Aggregate;
import com.cloudera.crunch.type.writable.Writables;

public class WordCount {
  public static void main(String[] args) throws Exception {
    // Create an object to coordinate pipeline creation and execution.
    Pipeline pipeline = new MRPipeline(WordCount.class);
    // Reference a given text file as a collection of Strings.
    PCollection&lt;String&gt; lines = pipeline.readTextFile(args[0]);

    // Define a function that splits each line in a PCollection of Strings into a
    // PCollection made up of the individual words in the file.
    PCollection&lt;String&gt; words = lines.parallelDo(new DoFn&lt;String, String&gt;() {
      public void process(String line, Emitter&lt;String&gt; emitter) {
        for (String word : line.split("\\s+")) {
          emitter.emit(word);
        }
      }
    }, Writables.strings()); // Indicates the serialization format

    // The Aggregate.count method applies a series of Crunch primitives and returns
    // a map of the unique words in the input PCollection to their counts.
    // Best of all, the count() function doesn't need to know anything about
    // the kind of data stored in the input PCollection.
    PTable&lt;String, Long&gt; counts = Aggregate.count(words);

    // Instruct the pipeline to write the resulting counts to a text file.
    pipeline.writeTextFile(counts, args[1]);
    // Execute the pipeline as a MapReduce.
    pipeline.done();
  }
}</pre>
<p></p>
<h2>Advantages</h2>
<ol>
<li><strong>It&#8217;s just Java.</strong> Crunch shares a core philosophical belief with Google&#8217;s FlumeJava: <i>novelty is the enemy of adoption</i>. For developers, learning a Java library requires much less up-front investment than learning a new programming language. Crunch provides full access to the power of Java for writing functions, managing pipeline execution, and dynamically constructing new pipelines, obviating the need to switch back and forth between a data flow language and a real programming language.</li>
<li><strong>Natural type system.</strong> Crunch supports reading and writing data that is stored using Hadoop&#8217;s Writable format or <a href="http://avro.apache.org/" target="_blank">Apache Avro</a> records. You do not need to write code that maps data stored in these formats into Crunch&#8217;s type system&#8211; they are supported natively. You can even mix and match Writable and Avro types within a single MapReduce: changing the <code>Writables.strings()</code> call to <code>Avros.strings()</code> in the WordCount example will run the MapReduce using Avro serialization instead of Writables.</li>
<li><strong>A modular library released under the Apache License.</strong> Experts in machine learning, text mining, and ETL can craft libraries using Crunch&#8217;s data model, and other developers can use those libraries to build custom pipelines that operate on their data. For example, Crunch can be used to create the glue code that converts raw data into the structured input that a machine learning algorithm expects, and Crunch will compile the glue code and the machine learning algorithm into a single MapReduce.</li>
</ol>
<h2> Future Work</h2>
<p> We are releasing Crunch as a development project, not a product. We&#8217;re eager for developers to play with it and tell us what they like and what they dislike. You can get started with Crunch by downloading it from Cloudera&#8217;s github repository <a href="https://github.com/cloudera/crunch" target="_blank">here</a>.</p>
<p>We have tested the library on a number of our use cases, but there will be bugs and rough edges that we will work out in the coming months. We gladly welcome contributions from the Hadoop ecosystem to help us improve Crunch as we prepare it for submission to the Apache Incubator, especially around:</p>
<ul>
<li>More efficient MapReduce compilation, including cost-based optimization,</li>
<li>Support for HBase and HCatalog as data sources/targets,</li>
<li>Tools and examples that build Crunch pipelines in other JVM languages, such as Scala, JRuby, Clojure, and Jython.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/10/introducing-crunch/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Apache Sqoop &#8211; Overview</title>
		<link>http://www.cloudera.com/blog/2011/10/apache-sqoop-overview/</link>
		<comments>http://www.cloudera.com/blog/2011/10/apache-sqoop-overview/#comments</comments>
		<pubDate>Thu, 06 Oct 2011 18:49:27 +0000</pubDate>
		<dc:creator>Arvind Prabhakar</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[CDH]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[Connector]]></category>
		<category><![CDATA[data collection]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[oozie]]></category>
		<category><![CDATA[sqoop]]></category>
		<category><![CDATA[Apache Sqoop]]></category>
		<category><![CDATA[Hadoop connector]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9138</guid>
		<description><![CDATA[This post provides a high-level overview of Apache Sqoop (incubating). It discusses the general problem addressed by Sqoop and provides simple examples on how to use it. This post is written by Arvind Prabhakar, who is a Sqoop committer.]]></description>
			<content:encoded><![CDATA[<p><em>This blog was originally posted on the Apache Blog: <a href="https://blogs.apache.org/sqoop/entry/apache_sqoop_overview" target="_blank">https://blogs.apache.org/sqoop/entry/apache_sqoop_overview</a></em></p>
<p>Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and exposes the production system to the risk of excessive load originating from cluster nodes.</p>
<p>This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing incubation at Apache Software Foundation. More information on this project can be found at <a title="Apache Sqoop" href="http://incubator.apache.org/sqoop" target="_blank">http://incubator.apache.org/sqoop</a>.</p>
<p>Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from external system on to HDFS, and populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks. Sqoop uses a connector based architecture which supports plugins that provide connectivity to new external systems.</p>
<p>What happens underneath the covers when you run Sqoop is very straightforward. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe manner since Sqoop uses the database metadata to infer the data types.</p>
<p>In the rest of this post we will walk through an example that shows the various ways you can use Sqoop. The goal of this post is to give an overview of Sqoop operation without going into much detail or advanced functionality.</p>
<h2>Importing Data</h2>
<p>The following command is used to import all data from a table called ORDERS from a MySQL database:</p>
<pre class="code" style="padding-bottom:10px">---
$ <strong>sqoop import --connect jdbc:mysql://localhost/acmedb \
    --table ORDERS --username test --password ****</strong>
---</pre>
<p>In this command the various options specified are as follows:</p>
<ul>
<li><em>import:</em> This is the sub-command that instructs Sqoop to initiate an import.</li>
<li><em>&#8211;connect &lt;connect string&gt;, &#8211;username &lt;user name&gt;, &#8211;password<br />
&lt;password&gt;:</em> These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection.</li>
<li><em>&#8211;table &lt;table name&gt;:</em> This parameter specifies the table which will be imported.</li>
</ul>
<p>The import is done in two steps as depicted in Figure 1 below. In the first Step Sqoop introspects the database to gather the necessary metadata for the data being imported. The second step is a map-only Hadoop job that Sqoop submits to the cluster. It is this job that does the actual data transfer using the metadata captured in the previous step.</p>
<p style="text-align: center"><img class="aligncenter" src="https://blogs.apache.org/sqoop/mediaresource/d76fa176-1331-4af3-95cf-ae6a0068c306" alt="Figure 1: Sqoop Import Overview" /></p>
<p style="text-align: center"><strong>Figure 1: Sqoop Import Overview</strong></p>
<p>The imported data is saved in a directory on HDFS based on the table being imported. As is the case with most aspects of Sqoop operation, the user can specify any alternative directory where the files should be populated.</p>
<p>By default these files contain comma delimited fields, with new lines separating different records. You can easily override the format in which data is copied over by explicitly specifying the field separator and record terminator characters.</p>
<p>Sqoop also supports different data formats for importing data. For example, you can easily import data in Avro data format by simply specifying the option <em>&#8211;as-avrodatafile</em> with the import command.</p>
<p>There are many other options that Sqoop provides which can be used to further tune the import operation to suit your specific requirements.</p>
<h2>Importing Data into Hive</h2>
<p>In most cases, importing data into Hive is the same as running the import task and then using Hive to create and load a certain table or partition. Doing this manually requires that you know the correct type mapping between the data and other details like the serialization format and delimiters. Sqoop takes care of populating the Hive metastore with the appropriate metadata for the table and also invokes the necessary commands to load the table or partition as the case may be. All of this is done by simply specifying the option &#8211;hive-import with the import command.</p>
<pre class="code" style="padding-bottom:10px">----
<span style="font-family: 'courier new', courier, monospace">$ sqoop import --connect jdbc:mysql://localhost/acmedb \
      --table ORDERS --username test --password **** <strong>--hive-import</strong></span>
----</pre>
<p>When you run a Hive import, Sqoop converts the data from the native datatypes within the external datastore into the corresponding types within Hive. Sqoop automatically chooses the native delimiter set used by Hive. If the data being imported has new line or other Hive delimiter characters in it, Sqoop allows you to remove such characters and get the data correctly populated for consumption in Hive.</p>
<p>Once the import is complete, you can see and operate on the table just like any other table in Hive.</p>
<h2>Importing Data into HBase</h2>
<p>You can use Sqoop to populate data in a particular column family within the HBase table. Much like the Hive import, this can be done by specifying the additional options that relate to the HBase table and column family being populated. All data imported into HBase is converted to their string representation and inserted as UTF-8 bytes.</p>
<pre class="code" style="padding-bottom:10px">----
<span style="font-family: 'courier new', courier, monospace">$ sqoop import --connect jdbc:mysql://localhost/acmedb \
        --table ORDERS --username test --password **** \
        <strong>--hbase-create-table --hbase-table ORDERS --column-family mysql</strong></span>
----</pre>
<p>In this command the various options specified are as follows:</p>
<ul>
<li><em>&#8211;hbase-create-table:</em> This option instructs Sqoop to create the HBase table.</li>
<li><em>&#8211;hbase-table:</em> This option specifies the table name to use.</li>
<li><em>&#8211;column-family:</em> This option specifies the column family name to use.</li>
</ul>
<p>The rest of the options are the same as that for regular import operation.</p>
<h2>Exporting Data</h2>
<p>In some cases data processed by Hadoop pipelines may be needed in production systems to help run additional critical business functions. Sqoop can be used to export such data into external datastores as necessary. Continuing our example from above &#8211; if data generated by the pipeline on Hadoop corresponded to the ORDERS table in a database somewhere, you could populate it using the following command:</p>
<pre class="code" style="padding-bottom:10px">----
$ sqoop <strong>export</strong> --connect jdbc:mysql://localhost/acmedb \
        --table ORDERS --username test --password **** \
        <strong>--export-dir /user/arvind/ORDERS</strong>
----</pre>
<p>In this command the various options specified are as follows:</p>
<ul>
<li><em>export:</em> This is the sub-command that instructs Sqoop to initiate an export.</li>
<li><em>&#8211;connect &lt;connect string&gt;, &#8211;username &lt;user name&gt;, &#8211;password<br />
&lt;password&gt;:</em> These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection.</li>
<li><em>&#8211;table &lt;table name&gt;:</em> This parameter specifies the table which will be populated.</li>
<li><em>&#8211;export-dir &lt;directory path&gt;:</em> This is the directory from which data will be exported.</li>
</ul>
<p>Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into splits and then uses individual map tasks to push the splits to the database. Each map task performs this transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.</p>
<p><img class="aligncenter" src="https://blogs.apache.org/sqoop/mediaresource/12624986-9e30-430e-a0c7-e12176548f6d" alt="Figure 2: Sqoop Export Overview" /></p>
<pstyle="text-align: center"><strong>Figure 2: Sqoop Export Overview</strong></p>
<p>Some connectors support staging tables that help isolate production tables from possible corruption in case of job failures due to any reason. Staging tables are first populated by the map tasks and then merged into the target table once all of the data has been delivered it.</p>
<h2>Sqoop Connectors</h2>
<p>Using specialized connectors, Sqoop can connect with external systems that have optimized import and export facilities, or do not support native JDBC. Connectors are plugin components based on Sqoop&#8217;s extension framework and can be added to any existing Sqoop installation. Once a connector is installed, Sqoop can use it to efficiently transfer data between Hadoop and the external store supported by the connector.</p>
<p>By default Sqoop includes connectors for various popular databases such as MySQL, PostgreSQL, Oracle, SQL Server and DB2. It also includes fast-path connectors for MySQL and PostgreSQL databases. Fast-path connectors are specialized connectors that use database specific batch tools to transfer data with high throughput. Sqoop also includes a generic JDBC connector that can be used to connect to any database that is accessible via JDBC.</p>
<p>Apart from the built-in connectors, many companies have developed their own connectors that can be plugged into Sqoop. These range from specialized connectors for enterprise data warehouse systems to NoSQL datastores.</p>
<h2>Wrapping Up</h2>
<p>In this post you saw how easy it is to transfer large datasets between Hadoop and external datastores such as relational databases. Beyond this, Sqoop offers many advance features such as different data formats, compression, working with queries instead of tables etc. We encourage you to try out Sqoop and give us your feedback.</p>
<p>More information regarding Sqoop can be found at:</p>
<p><span style="font-size: small">Project Website:&#160;<a target="_blank" title="Apache Sqoop" href="http://incubator.apache.org/sqoop">http://incubator.apache.org/sqoop</a><br />
Wiki:&#160;<a target="_blank" title="Sqoop Wiki" href="https://cwiki.apache.org/confluence/display/SQOOP">https://cwiki.apache.org/confluence/display/SQOOP</a><br />
Project Status: &#160;<a target="_blank" title="Sqoop Project Status" href="http://incubator.apache.org/projects/sqoop.html">http://incubator.apache.org/projects/sqoop.html</a><br />
Mailing Lists:&#160;<a target="_blank" title="Sqoop Mailing Lists" href="https://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists">https://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists</a></span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/10/apache-sqoop-overview/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>RecordBreaker: Automatic structure for your text-formatted data</title>
		<link>http://www.cloudera.com/blog/2011/07/recordbreaker-automatic-structure-for-your-text-formatted-data/</link>
		<comments>http://www.cloudera.com/blog/2011/07/recordbreaker-automatic-structure-for-your-text-formatted-data/#comments</comments>
		<pubDate>Wed, 13 Jul 2011 13:00:28 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[data parsing]]></category>
		<category><![CDATA[learn avro]]></category>
		<category><![CDATA[learnavro]]></category>
		<category><![CDATA[text-embedded data]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=7960</guid>
		<description><![CDATA[This post was contributed by Michael Cafarella, an assistant professor of computer science at the University of Michigan. Mike&#8217;s research interests focus on databases, in particular managing Web data. Before becoming a professor, he was one of the founders of the Nutch and Hadoop projects with Doug Cutting. This first version of RecordBreaker was developed [...]]]></description>
			<content:encoded><![CDATA[<p><em>This post was contributed by Michael Cafarella, an assistant professor of computer science at the University of Michigan.  Mike&#8217;s research interests focus on databases, in particular managing Web data.  Before becoming a professor, he was one of the founders of the Nutch and Hadoop projects with Doug Cutting.  This first version of RecordBreaker was developed by Mike in conjunction with Cloudera.</em></p>
<p>RecordBreaker is a project that automatically turns your text-formatted data (logs, sensor readings, etc) into structured data, without any need to write parsers or extractors.  In particular, RecordBreaker targets Avro as its output format.  The project&#8217;s goal is to dramatically reduce the time spent preparing data for analysis, enabling more time for the analysis itself.</p>
<p>Hadoop&#8217;s HDFS is often used to store large amounts of text-formatted data: log files, sensor readings, transaction histories, etc.  Much of this data is &#8220;near-structured&#8221;: the data has a format that&#8217;s obvious to a human observer, but is not made explicit in the file itself.</p>
<p>Imagine you have a simple file listing, stored in <tt>listing.txt</tt>:</p>
<pre style="margin-bottom:8px" class="code">
5 mjc staff 170 Mar 14 2011 14:14 bin
5 mjc staff 170 Mar 12 2011 05:13 build
1 mjc staff 11080 Mar 14 2011 14:14 build.xml
</pre>
<p>This &#8220;near-structured&#8221; data has metadata that is obvious to people who are familiar with file listings: a file owner, a file size, a last-modified date, etc.  It&#8217;s easy for people despite the fact that certain strings, such as the date and time, cannot be parsed with simple whitespace breaks.  In order for a user to process such data with MapReduce, Pig, or some similar tool, she must explicitly and laboriously reconstruct the metadata that is simple for anyone who just <em> eyeballs the data</em>.  </p>
<p>Performing this reconstruction usually entails writing a parser or extractor, often one based on relatively brittle regular expressions.  For some very common data, writing a good parser for it is probably worthwhile.  However, there are also album track listings, temperature readings, flight schedules, and many other kinds of data; the number of good parsers we need to write gets large, quickly.  Writing all of these straightforward extractors, again and again, is a time-consuming and error-prone pain for everyone.  We believe it is a major obstacle to faster and easier data analytics.</p>
<p>The RecordBreaker project aims to <em>automatically generate structure</em> for text-embedded data.  It consists of two main components.</p>
<hr />
<h2>LearnStructure</h2>
<p><strong>LearnStructure</strong> takes a text file as input and derives a parser that breaks lines of the file into typed fields.  For example, the above file listing is broken into fields that include the file owner <tt>mjc</tt>, the group owner <tt>staff</tt>, etc.  It emits all the schemas and code necessary to turn the raw text file into a file full of structured data.  For example, we discover a JSON schema for the above file listing that looks like this:</p>
<pre style="margin-bottom:8px" class="code">
{
&nbsp;  "type" : "record",
&nbsp;  "name" : "record_1",
&nbsp;  "namespace" : "",
&nbsp;  "doc" : "RECORD",
&nbsp;  "fields" : [ {
&nbsp;&nbsp;    "name" : "base_0",
&nbsp;&nbsp;    "type" : "int",
&nbsp;&nbsp;    "doc" : "Example data: '5', '5', '1'"
&nbsp;  }, {
&nbsp;&nbsp;    "name" : "base_2",
&nbsp;&nbsp;    "type" : "string",
&nbsp;&nbsp;    "doc" : "Example data: 'mjc', 'mjc', 'mjc'"
&nbsp;  }, {
&nbsp;&nbsp;    "name" : "base_4",
&nbsp;&nbsp;    "type" : "string",
&nbsp;&nbsp;    "doc" : "Example data: 'staff', 'staff', 'staff'"
&nbsp;  }, {
&nbsp;&nbsp;    "name" : "base_6",
&nbsp;&nbsp;    "type" : "int",
&nbsp;&nbsp;    "doc" : "Example data: '170', '170', '11080'"
&nbsp;  }, {
&nbsp;&nbsp;    "name" : "base_8",
&nbsp;&nbsp;    "type" : {
&nbsp;&nbsp;&nbsp;      "type" : "record",
&nbsp;&nbsp;&nbsp;      "name" : "base_8",
&nbsp;&nbsp;&nbsp;      "doc" : "",
&nbsp;&nbsp;&nbsp;      "fields" : [ {
&nbsp;&nbsp;&nbsp;&nbsp;        "name" : "month",
&nbsp;&nbsp;&nbsp;&nbsp;        "type" : "int",
&nbsp;&nbsp;&nbsp;&nbsp;        "doc" : ""
&nbsp;&nbsp;&nbsp;      }, {
&nbsp;&nbsp;&nbsp;&nbsp;        "name" : "day",
&nbsp;&nbsp;&nbsp;&nbsp;        "type" : "int",
&nbsp;&nbsp;&nbsp;&nbsp;        "doc" : ""
&nbsp;&nbsp;&nbsp;      }, {
&nbsp;&nbsp;&nbsp;&nbsp;        "name" : "year",
&nbsp;&nbsp;&nbsp;&nbsp;        "type" : "int",
&nbsp;&nbsp;&nbsp;&nbsp;        "doc" : ""
&nbsp;&nbsp;&nbsp;      } ]
&nbsp;&nbsp;    },
&nbsp;&nbsp;    "doc" : "Example data: '(14, 3, 2011)', '(12, 3, 2011)', '(14, 3, 2011)'"
&nbsp;  }, {
&nbsp;&nbsp;    "name" : "base_10",
&nbsp;&nbsp;    "type" : {
&nbsp;&nbsp;&nbsp;      "type" : "record",
&nbsp;&nbsp;&nbsp;      "name" : "base_10",
&nbsp;&nbsp;&nbsp;      "doc" : "",
&nbsp;&nbsp;&nbsp;      "fields" : [ {
&nbsp;&nbsp;&nbsp;&nbsp;        "name" : "hrs",
&nbsp;&nbsp;&nbsp;&nbsp;        "type" : "int",
&nbsp;&nbsp;&nbsp;&nbsp;        "doc" : ""
&nbsp;&nbsp;&nbsp;      }, {
&nbsp;&nbsp;&nbsp;&nbsp;        "name" : "mins",
&nbsp;&nbsp;&nbsp;&nbsp;        "type" : "int",
&nbsp;&nbsp;&nbsp;&nbsp;        "doc" : ""
&nbsp;&nbsp;&nbsp;      }, {
&nbsp;&nbsp;&nbsp;&nbsp;        "name" : "secs",
&nbsp;&nbsp;&nbsp;&nbsp;        "type" : "int",
&nbsp;&nbsp;&nbsp;&nbsp;        "doc" : ""
&nbsp;&nbsp;&nbsp;      } ]
&nbsp;&nbsp;    },
&nbsp;&nbsp;    "doc" : "Example data: '(14, 14, 0)', '(5, 13, 0)', '(14, 14, 0)'"
&nbsp;  }, {
&nbsp;&nbsp;    "name" : "base_12",
&nbsp;&nbsp;    "type" : "string",
&nbsp;&nbsp;    "doc" : "Example data: 'bin', 'build', 'build.xml'"
&nbsp;  } ]
}
</pre>
<p>Of course, the field names here are nonsense.  All of the values, except for subfields of the date and timestamp records, have nondescriptive synthetically-generated names.  The LearnStructure step attempts to recover the type of each field, but has no way to know its name or role.  Obtaining names for these fields is the job of the <strong>SchemaDictionary step</strong>.  For now, we just live with these bad synthetic names.</p>
<hr />
<h2>SchemaDictionary</h2>
<p><strong>SchemaDictionary</strong> takes data that&#8217;s been parsed by LearnStructure and applies topic-specific labels.  For example, <tt>mjc</tt> should ideally be labelled as <em>owner</em> or perhaps <em>user</em>.  The parsed <tt>staff</tt> data should be labelled as <em>group</em>.</p>
<p>The <strong>SchemaDictionary</strong> tool matches the newly-parsed data against a known database of structures.  It finds the closest match, then assigns human-understandable names based on the best-matching previously-observed dataset.  For example, with the above data and a small set of known datasets, <strong>SchemaDictionary</strong> can find that <tt>base_10</tt> should actually be <tt>timemodified</tt>, that <tt>base_8</tt> should be <tt>datemodified</tt>, and so on.  Depending on the input data and the known database of structures, this labelling may be more or less accurate.</p>
<p>As mentioned, the target structured data format is <a target="_about" href="http://avro.apache.org/">Avro</a>.  Avro allows efficient cross-platform data serialization, similar to <a target="_about" href="http://incubator.apache.org/thrift/">Thrift</a> or <a target="_about" href="http://code.google.com/p/protobuf/">Protocol Buffers</a>.  Data stored in Avro has many advantages (read <a target="_about" href="http://www.cloudera.com/blog/2011/07/avro-data-interop/"> Doug Cutting&#8217;s recent overview of Avro</a> for more) and many tools either support Avro or will soon: <a target="_about" href="http://hadoop.apache.org/mapreduce/">Hadoop MapReduce</a>, <a target="_about" href="http://pig.apache.org/">Apache Pig</a>, and others.</p>
<hr />
<h2>Related Work</h2>
<p>Our work on the LearnStructure component draws inspiration from the <a target="_about" href="http://www.padsproj.org/index.html">PADS research project</a> (<a target="_about" href="http://www.padsproj.org/index.html">http://www.padsproj.org/index.html</a>), in particular the paper <a target="_about" href="http://www.padsproj.org/papers/popl08.pdf">From Dirt to Shovels: Fully Automatic Tool Generation from Ad Hoc Data, by Fisher, Walker, Zhu, and White.  Published in POPL, 2008.</a>.  That paper itself draws on many papers in the area of information extraction and related fields.  The authors have released code for their system, written in ML.  ML is a great language, but is not well-suited to our needs: it is not supported by Avro, and is unlikely to appeal to many of the developers currently involved with the Hadoop ecosystem.</p>
<p>SchemaDictionary is more generally inspired by database schema mapping systems.  (A famous example is described in <a target="_about" href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1842&#038;rep=rep1&#038;type=pdf">The Clio Project: Managing Heterogeneity, by Miller, Hernandez, Haas, Yan, Ho, Fagin, and Popa, published in SIGMOD Record 30(1), March 2001, pp.78-83</a>.)  Schema mapping systems are usually designed to help database administrators merge existing databases; for example, when company A purchases company B and must then merge the employee lists.  These tools are often expensive and expect a lot of administrator attention.  In contrast, our SchemaDictionary is for busy data analysts who simply want to check out a novel dataset as quickly as possible.  It is fast and simple, but can only handle relatively simple structures (rendering it inappropriate for databases, but on target for the kind of data that is popular in text-based formats).</p>
<h2>Project</h2>
<p>RecordBreaker works, but is not complete.  It is just the start of what we hope will be many interesting applications and research projects. Please take a look at the <a target="_about" href="https://github.com/cloudera/RecordBreaker">code</a> and <a target="_about" href="http://cloudera.github.com/RecordBreaker/">documentation</a> (the <a target="_about" href="https://github.com/cloudera/RecordBreaker">repo</a> is at <a target="_about" href="https://github.com/cloudera/RecordBreaker">https://github.com/cloudera/RecordBreaker</a>, and the <a target="_about" href="http://cloudera.github.com/RecordBreaker/">tutorial</a> is at <a target="_about" href="http://cloudera.github.com/RecordBreaker/">http://cloudera.github.com/RecordBreaker/</a>). Maybe you can pitch in and help.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/07/recordbreaker-automatic-structure-for-your-text-formatted-data/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Data Interoperability with Apache Avro</title>
		<link>http://www.cloudera.com/blog/2011/07/avro-data-interop/</link>
		<comments>http://www.cloudera.com/blog/2011/07/avro-data-interop/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 19:13:37 +0000</pubDate>
		<dc:creator>Doug Cutting</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[Flume]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[sqoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=8075</guid>
		<description><![CDATA[The ecosystem around Apache Hadoop has grown at a tremendous rate. Folks now can use many different pieces of software to process their large data sets, and most choose to use several of these components. Data collected by Flume might be analyzed by Pig and Hive scripts. Data imported with Sqoop might be processed by [...]]]></description>
			<content:encoded><![CDATA[<p>The ecosystem around Apache Hadoop has grown at a tremendous rate. Folks now can use many different pieces of software to process their large data sets, and most choose to use several of these components.  Data collected by Flume might be analyzed by Pig and Hive scripts.  Data imported with Sqoop might be processed by a MapReduce program.  To facilitate these and other scenarios, data produced by each component must be readily consumed by other components.</p>
<h1>Data Interoperability</h1>
<p>One might address this data interoperability in a variety of manners, including the following:</p>
<ul>
<li>Each system might be extended to read all the formats generated by the other systems.  In the limit, this approach is not practical, since one cannot easily anticipate all of the formats new systems might generate.</li>
<li>A library of data conversion programs could be assembled. This would unfortunately add a processing step, to convert the data between formats, slowing processing pipelines.  Note however that many data conversion libraries operate by converting data into and out of a <em>lingua franca</em> format, using a single format as a pivot point. &#160;This hints at a third possibility.</li>
<li>Enable each system to read and write a common format. &#160;Some systems might use other formats internally for performance, but whenever data is meant to be accessible to other systems a common format is used.</li>
</ul>
<p>In practice all of these strategies will used to some extent.  However the last strategy, a common format, seems to offer the most efficient path both in terms of engineering effort and processing time.  This article will focus on the use of Avro&#8217;s data file format as such a common format.</p>
<h1>Avro</h1>
<p>Apache&#160;<a href="http://avro.apache.org/">Avro</a> is a data serialization format.  Avro shares many features with Google&#8217;s Protocol Buffers and Apache Thrift, including:</p>
<ul>
<li>Rich data types.</li>
<li>Fast, compact serialization.</li>
<li>Support for many programming languages.</li>
<li>Datatype evolution, also known as&#160;<em>versioning.</em></li>
</ul>
<p>Avro additionally provides some other features that are especially useful when storing data, namely:</p>
<ul>
<li>Avro defines a standard file format.  Avro data files are self-describing, containing the full schema for the data in the file.  Thus users can exchange Avro data files without also having to separately communicate metadata. &#160;Once an Avro data file is written, one will always be able to read it, with full datatype information, without relying on any external software or metadata repository. &#160;Avro data files also support compression, using Gzip or <a href="http://code.google.com/p/snappy/">Snappy</a> codecs. </li>
<li>Avro&#8217;s serialization is more compact.  Avro avoids storing a field identifier with each field value.  For some datasets this savings can be significant. </li>
<li>Avro implementations permit one to dynamically define new datatypes and to easily process previously unseen datatypes, without generation and loading of code.  This provides natural support for script and query languages. </li>
<li>Avro datatypes can define their sort-order, facillitating use of Avro data in MapReduce or ordered key/value stores. </li>
</ul>
<h1>Avro as a Common Format</h1>
<p>Most of the major ecosystem components already or will soon support reading and writing Avro data files:</p>
<ul>
<li>MapReduce: I added support for Java MapReduce programs, <a href="http://s.apache.org/o6">included</a> in Avro 1.4 and greater.</li>
<li><a href="http://hadoop.apache.org/common/docs/current/streaming.html">Streaming</a>: Tom White from Cloudera has added support for Hadoop Streaming programs to Avro (<a href="https://issues.apache.org/jira/browse/AVRO-808">AVRO-808</a> &amp;&#160;<a href="https://issues.apache.org/jira/browse/AVRO-830">AVRO-830</a>).</li>
<li><a href="http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/">Flume</a> 0.9.2 and above support collecting data in Avro&#8217;s format (<a href="https://issues.apache.org/jira/browse/FLUME-133">FLUME-133</a>), contributed by Jon Hsieh of Cloudera. &#160;Note also that Flume has recently been accepted into the Apache Incubator and will soon be known as Apache Flume.</li>
<li><a href="http://www.cloudera.com/blog/2009/06/introducing-sqoop/">Sqoop</a> 1.3 can import data as Avro data files in HDFS from a relational database (<a href="https://issues.cloudera.org/browse/SQOOP-207">SQOOP-207</a>), contributed by Tom White of Cloudera. &#160;Sqoop has also recently been accepted into the Apache Incubator.</li>
<li><a href="http://pig.apache.org/">Pig</a> release 0.9 will be able read and write Avro data files (<a href="https://issues.apache.org/jira/browse/PIG-1748">PIG-1748</a>), thanks to Lin Guo and Jakob Homan at LinkedIn. </li>
<li><a href="http://hive.apache.org/">Hive</a> support for reading and writing Avro data files has been <a href="https://github.com/jghoman/haivvreo#readme">posted</a> by Jakob Homan of LinkedIn, and should hopefully be included in Hive 0.9 (<a href="https://issues.apache.org/jira/browse/HIVE-895">HIVE-895</a>). </li>
<li><a href="http://incubator.apache.org/hcatalog/">HCatalog</a> input and output drivers have been contributed by Tom White of Cloudera (<a href="https://issues.apache.org/jira/browse/HCATALOG-49">HCATALOG-49</a>).</li>
<li>Thiruvalluvan M. G.&#160;from Yahoo! is working on a column-major format for Avro, which would accelerate Hive and Pig queries (<a href="https://issues.apache.org/jira/browse/AVRO-806">AVRO-806</a>).</li>
</ul>
<p>For folks who are currently using Protocol Buffers or Thrift to store data, some tools for conversion are planned:</p>
<ul>
<li>Raghu Angadi from Twitter is working on tools that will let folks     read and write their Thrift-defined data structures as Avro format data (<a href="https://issues.apache.org/jira/browse/AVRO-804">AVRO-804</a>).</li>
<li>We also hope to soon add tools to convert between Protocol Buffers and Avro (<a href="https://issues.apache.org/jira/browse/AVRO-805">AVRO-805</a>).</li>
</ul>
<p>At Cloudera we&#8217;re committed to helping Avro become a common format for the Hadoop ecosystem. &#160;It&#8217;s great to see so many other companies and individuals also investing in Avro.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/07/avro-data-interop/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Three Reasons Why Apache Avro Data Serialization is a Good Choice for OpenRTB</title>
		<link>http://www.cloudera.com/blog/2011/05/three-reasons-why-apache-avro-data-serialization-is-a-good-choice-for-openrtb/</link>
		<comments>http://www.cloudera.com/blog/2011/05/three-reasons-why-apache-avro-data-serialization-is-a-good-choice-for-openrtb/#comments</comments>
		<pubDate>Fri, 13 May 2011 18:26:13 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[guest]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=7937</guid>
		<description><![CDATA[This is a guest repost from the DataXu blog. Click here to view the original post. I recently evaluated several serialization frameworks including Thrift, Protocol Buffersand Avro for a solution to address our needs as a demand side platform, but also for a protocol framework to use for the OpenRTB marketplace as well. The working [...]]]></description>
			<content:encoded><![CDATA[<p><i>This is a guest repost from the <a href="http://www.dataxu.com/" target="_about">DataXu</a> blog. Click <a href="http://www.dataxu.com/2011/05/three-reasons-why-apache-avro-data-serialization-is-a-good-choice-for-openrtb/" target="_about">here</a> to view the original post.</i></p>
<p>I recently evaluated several serialization frameworks including<a href="http://thrift.apache.org/" target="_about"> Thrift</a>, <a href="http://code.google.com/p/protobuf/" target="_about">Protocol Buffers</a>and <a href="http://avro.apache.org/" target="_about">Avro</a> for a solution to address our needs as a demand side platform, but also for a protocol framework to use for the <a href="http://openrtb.info/" target="_about">OpenRTB</a> marketplace as well. The working draft of OpenRTB 2.0 uses simple <a href="http://www.json.org/" target="_about">JSON</a> encoding, which has many advantages including simplicity and ubiquity of support. Many OpenRTB contributors requested we support at least one binary standard as well, to improve bandwidth usage and CPU processing time for real-time bidding at scale.</p>
<p>After reviewing many candidates, <a href="http://avro.apache.org/docs/current/" target=_about">Apache Avro</a> proved to be the best solution.</p>
<p><a href="http://avro.apache.org/" target="_about"><img src="https://www.cloudera.com/wp-content/uploads/2011/05/Avro-Image.png" style="float:right;margin-left:8px" alt="Apache Avro" /></a></p>
<p>To demonstrate what differentiates Avro from the other frameworks (the link to my source code is at the end of this post), I put together a quick test of key features. The following are the key advantages of Avro 1.5:</p>
<p>* <strong>Schema evolution</strong> &#8211; Avro requires schemas when data is written or read. Most interesting is that you can use different schemas for serialization and deserialization, and Avro will handle the missing/extra/modified fields.</p>
<p>* <strong>Untagged data </strong>&#8211; Providing a schema with binary data allows each datum be written without overhead. The result is more compact data encoding, and faster data processing.</p>
<p>*<strong> Dynamic typing</strong> &#8211; This refers to serialization and deserialization without code generation. It complements the code generation, which is available in Avro for statically typed languages as an optional optimization.</p>
<h2>Schema Evolution</h2>
<p>This is the most exciting feature! It allows for building less decoupled and more robust systems. Below, I made significant changes to the schema, and things still work fine. This flexibility is a very interesting feature for rapidly evolving protocols like OpenRTB.</p>
<p>The following example demonstrates how this works.   First, I created a new (example) schema. (Avro schemas are defined in JSON):</p>
<p><pre class="code">{
    "type": "record",
    "name": "Employee",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "age", "type": "int"},
        {"name": "emails", "type": {"type": "array", "items": "string"}},
        {"name": "boss", "type": ["Employee","null"]}
    ]
}</pre>
</p>
<p>Next, I serialized a few records into a binary file using that schema. After that, I evolved my schema to the following:</p>
<p><pre class="code">{
    "type": "record",
    "name": "Employee",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "yrs", "type": "int", "aliases": ["age"]},
        {"name": "gender", "type": "string", "default":"unknown"},
        {"name": "emails", "type": {"type": "array", "items": "string"}}
    ]
}</pre>
</p>
<p>This is a snapshot of the changes I made to the schema:</p>
<p>1) Renamed the field &#8216;age&#8217; to &#8216;yrs&#8217;. Thanks to the alias feature, I can retrieve the value of &#8216;age&#8217; by using the field name &#8216;yrs&#8217;.</p>
<p>2) Added a new &#8216;gender&#8217; field, and defined a default value for it. This can be used to set values during deserialization as this field isn&#8217;t present in the original schema records.</p>
<p>3) Removed the &#8216;boss&#8217; field.</p>
<p>Finally, I deserialized the binary data file with this new schema, and print it out. Success!</p>
<h2>Untagged Data</h2>
<p>There are two ways to encode data when serializing with Avro: binary or JSON. In the binary file, the schema is included at the beginning of file. I verified that the binary data was serialized untagged, which resulted in a smaller footprint. Another interesting point is that the schema can be defined, and then the data can be encoded/decoded in JSON; allowing you to define a schema for JSON rich data structures. Anyone needing to implement validation for a JSON protocol (like we did for OpenRTB) will appreciate this feature. And switching between binary and JSON encoding is simply a one-line code change. Switching JSON protocol to a binary format in order to achieve better performance is pretty straightforward with Avro.</p>
<h2>Dynamic Typing</h2>
<p>The key abstraction is GenericData.Record. This is essentially a set of name-value pairs where name is the field name, and value is one of the Avro supported value types. I found the dynamic typing to be very easy to use. When a generic record is instantiated, you have to provide a JSON-encoded schema definition. To access the fields, just use put/get methods like you would with any map. This approach is referred to as &#8220;generic&#8221; in Avro, in contrast to the &#8220;static&#8221; code generation approach also supported by Avro. The extra flexibility of the generic data handling has performance implications. But, this excellent benchmark &#8211; <a href="https://github.com/eishay/jvm-serializers/wiki/" target="_about">https://github.com/eishay/jvm-serializers/wiki/</a> &#8211; shows the penalty is minor, and the benefit is a simplified code base.</p>
<p>In conclusion, Avro is a unique serialization framework that works, although it took a bit of experimentation to get the code working. If you are interested in my Java code for an example of how Avro can be used, you can find it here: <a href="https://github.com/rfoldes/Avro-Test" target="_about">https://github.com/rfoldes/Avro-Test</a>.</p>
<p>Robert Foldes</p>
<p>Senior Architect, <a href="http://www.dataxu.com/" target="_about">DataXu</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/05/three-reasons-why-apache-avro-data-serialization-is-a-good-choice-for-openrtb/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Tracing with Avro</title>
		<link>http://www.cloudera.com/blog/2010/09/tracing-with-avro/</link>
		<comments>http://www.cloudera.com/blog/2010/09/tracing-with-avro/#comments</comments>
		<pubDate>Fri, 03 Sep 2010 14:00:30 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[Avro]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4639</guid>
		<description><![CDATA[Written by Patrick Wendell, an amazing summer intern with Cloudera and an Avro Committer. In my summer internship project at Cloudera, I added RPC tracing as a first-order feature of Apache Avro. Avro is a platform for data storage and exchange that caters to data-intensive, dynamic applications. My project focused on Avro&#8217;s RPC functionality. It [...]]]></description>
			<content:encoded><![CDATA[<p><em><strong>Written by Patrick Wendell, an amazing summer intern with Cloudera and an Avro Committer.</strong></em></p>
<p><em><strong> </strong></em><em><strong> </strong></em></p>
<p><em><strong> </strong></em>In my summer internship project at Cloudera, I added RPC tracing as a first-order feature of Apache Avro. Avro is a platform for data storage and exchange that caters to data-intensive, dynamic applications. My project focused on Avro&#8217;s RPC functionality.</p>
<p>It is common knowledge that tracing in distributed systems can be difficult. In user-facing web services, a front-end function may recursively trigger several function calls to mid and back-tier services. In offline processing, data-center storage layers may distribute data across several hosts, querying one or many of them when a client requests a file. In either case, the inter-dependency of components makes it difficult to pinpoint the source of a slowdown or hang-up when they inevitably occur.</p>
<div>
<p>AvroTrace is designed as a first responder for diagnosing problems in distributed systems that use Avro for RPC transport. It has two components, a real-time monitoring dashboard and an offline trace analyzer. Both run as low-overhead Avro plugins which store and propagate tracing meta-data among RPC clients and servers. The monitoring dashboard is accessible via a web interface on any Avro server, delivering a &#8220;snapshot&#8221; of the most recent RPC activity. The offline analysis tool offers a basic interface for collecting, aggregating, and analyzing this data to identify problem spots. It is largely based on <a href="http://research.google.com/pubs/pub36356.html"><span style="font-weight: normal"><span style="font-style: normal">Google&#8217;s Dapper</span></span></a><span style="font-weight: normal"><span style="font-style: normal"> tracing infrastructure, which is itself inspired by </span></span><a href="http://www.x-trace.net/wiki/doku.php"><span style="font-weight: normal"><span style="font-style: normal">X-Trace</span></span></a><span style="font-weight: normal"><span style="font-style: normal"> and other academic tracing research.</span></span></p>
<p>Below is an example trace analysis of a recursive RPC call pattern. In the example application, &#160;one remote call, getFile() triggers two other RPC&#8217;s, getFileContents() and getFileMeta(). Avro&#8217;s tracing has detected this particular pattern and offers a dashboard view summarizing average timing and payload data. It is also showing detailed graphs for one of the specific nodes in this pattern, getFileContents() presenting a visual history of timing (top) and payload (bottom) analytics.</p>
<p>Turnkey tracing is just one of many reasons to use Avro. &#160;I recently became a committer on the Avro project and I look forward to supporting and improving trace functionality in the coming months!</p>
<p style="text-align: center"><a href="http://www.cloudera.com/wp-content/uploads/2010/09/Untitled.png"><img class="aligncenter size-full wp-image-4657" src="http://www.cloudera.com/wp-content/uploads/2010/09/Untitled.png" alt="" width="700" /></a><em> </em></p>
<h5 style="text-align: center"><em>*Click on any of the graphs or stats for a larger version</em></h5>
<p><em><br />
</em></p>
<h2><em>Learn more about Avro and other Hadoop projects at </em><em><a href="http://www.cloudera.com/company/press-center/hadoop-world-nyc/"><span style="color: #359ac9">Hadoop World!</span></a></em></h2>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/09/tracing-with-avro/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Avro 1.3.0</title>
		<link>http://www.cloudera.com/blog/2010/03/avro-1-3-0/</link>
		<comments>http://www.cloudera.com/blog/2010/03/avro-1-3-0/#comments</comments>
		<pubDate>Mon, 01 Mar 2010 22:26:56 +0000</pubDate>
		<dc:creator>Matt Massie</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=2609</guid>
		<description><![CDATA[Avro was added the to Hadoop family last April and last year there were three Apache Avro releases: 1.0.0 in July, 1.1.0 in September and 1.2.0 in October. &#160;After the 1.2.0 release, Doug Cutting introduced Avro: a New Format for Data Interchange on this blog and the Avro team went right to work building the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://hadoop.apache.org/avro/">Avro</a> was added the to <a href="http://hadoop.apache.org/">Hadoop</a> family last April and last year there were three Apache Avro releases: <strong>1.0.0</strong> in July, <strong>1.1.0</strong> in September and <strong>1.2.0</strong> in October. &#160;After the 1.2.0 release, Doug Cutting introduced <a href="http://www.cloudera.com/blog/2009/11/avro-a-new-format-for-data-interchange/">Avro: a New Format for Data Interchange</a> on this blog and the Avro team went right to work building the next release of Avro.</p>
<p>It&#8217;s a new year and there&#8217;s a new Avro: <strong>1.3.0</strong>.</p>
<p>Starting with Avro 1.3.0, the Avro team is releasing packages specially tailored to consumers of each language. &#160;For example, Python users can download an egg, Java users can manage jars using Maven and C/C++<strong> </strong>users can grab an autotools package ready to&#160;<code>`./configure; make`</code>. &#160;Speaking of languages, we&#8217;re thrilled to announce that there&#8217;s a Ruby implementation for Avro now!</p>
<p>The Avro specification has been updated to include support for Avro RPC over HTTP. &#160;Currently, only Java and Python support this new RPC specification but you can expect other languages to follow. &#160;The Avro team also designed a test framework to ensure&#160;interoperability between any mix of Avro RPC clients and servers.</p>
<p>In Avro 1.3.0, there&#8217;s a new Avro data file format that is simpler, better suited for compression and provides support for streaming Avro data. &#160;You&#8217;ll find support for this new file format in Ruby, Python, Java and C; giving you an array of languages to choose from for reading and writing Avro data.</p>
<p>There have been more features added to Avro than can fit in a single blog post but here are some of the highlights.</p>
<hr />
<h2 style="text-align: center;">Java</h2>
<ul>
<li>Substantial improvements to Reflection API.  Now uses java.lang.String for Avro strings, either Java collections or arrays for Avro arrays, etc.</li>
<li>New GenAvro tool provides a high-level syntax for schemas and protocols.</li>
<li>Command-line tools jar for debugging.</li>
<li>An RPC statistics system.</li>
<li>Support for compression in data files</li>
<li>Better Maven support including a mvn-install ant task to publish jar to local Maven repository, plus source and javadoc artifacts.</li>
<li>Substantial performance improvements.</li>
<li>Many bug fixes.</li>
</ul>
<h2 style="text-align: center;">Python</h2>
<ul>
<li>Rewritten to be slightly more Pythonic, simpler, and with greater test coverage</li>
<li>RPC over HTTP support</li>
<li>RPC and data file interoperability</li>
<li>New command-line utility for sending and receiving RPCs</li>
<li>Python eggs created</li>
</ul>
<h2 style="text-align: center;">C++</h2>
<p>The C++ implementation now uses autotools for its build, has a new API for checking schema resolution and provides a new tutorial to make it easier for you to get up and running with Avro in C++.</p>
<h2 style="text-align: center;">Ruby</h2>
<p>Ruby hackers will be happy to hear that Ruby has been added to Avro 1.3.0 complete with support for the new data file format.</p>
<h2 style="text-align: center;">C</h2>
<p>The C implementation has been completely rewritten from top to bottom and</p>
<ul>
<li>supports reading and writing the new Avro data file format</li>
<li>adds a contact database example to make it easier for you learn the Avro C API</li>
<li>provides schema validation, promotion and projection</li>
<li>allows schema validation to be optional</li>
<li>removes all dependencies on external libraries (e.g. APR, APR-util)</li>
<li>embeds <a href="http://www.digip.org/jansson/">jansson</a> for JSON parsing</li>
</ul>
<hr />To download Avro 1.3.0, visit the <a href="http://hadoop.apache.org/avro/releases.html">Avro releases page</a>. &#160;Once you&#8217;ve downloaded Avro, you might want to &#160;take a look at the <a href="http://hadoop.apache.org/avro/docs/1.3.0/">Avro documentation page</a> as well.</p>
<p>You can contact the Avro team by visiting the <code>#avro</code> irc channel on irc.freenode.net or through <a href="http://hadoop.apache.org/avro/mailing_lists.html">one of the Avro mailing lists</a>. &#160;The Avro team is always open to suggestions about future features and would love to hear about your experiences using Avro 1.3.0.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/03/avro-1-3-0/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

