<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; hive</title>
	<atom:link href="http://www.cloudera.com/blog/tag/hive/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Thu, 24 May 2012 17:53:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Data Interoperability with Apache Avro</title>
		<link>http://www.cloudera.com/blog/2011/07/avro-data-interop/</link>
		<comments>http://www.cloudera.com/blog/2011/07/avro-data-interop/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 19:13:37 +0000</pubDate>
		<dc:creator>Doug Cutting</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[Flume]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[sqoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=8075</guid>
		<description><![CDATA[The ecosystem around Apache Hadoop has grown at a tremendous rate. Folks now can use many different pieces of software to process their large data sets, and most choose to use several of these components. Data collected by Flume might be analyzed by Pig and Hive scripts. Data imported with Sqoop might be processed by [...]]]></description>
			<content:encoded><![CDATA[<p>The ecosystem around Apache Hadoop has grown at a tremendous rate. Folks now can use many different pieces of software to process their large data sets, and most choose to use several of these components.  Data collected by Flume might be analyzed by Pig and Hive scripts.  Data imported with Sqoop might be processed by a MapReduce program.  To facilitate these and other scenarios, data produced by each component must be readily consumed by other components.</p>
<h1>Data Interoperability</h1>
<p>One might address this data interoperability in a variety of manners, including the following:</p>
<ul>
<li>Each system might be extended to read all the formats generated by the other systems.  In the limit, this approach is not practical, since one cannot easily anticipate all of the formats new systems might generate.</li>
<li>A library of data conversion programs could be assembled. This would unfortunately add a processing step, to convert the data between formats, slowing processing pipelines.  Note however that many data conversion libraries operate by converting data into and out of a <em>lingua franca</em> format, using a single format as a pivot point. &#160;This hints at a third possibility.</li>
<li>Enable each system to read and write a common format. &#160;Some systems might use other formats internally for performance, but whenever data is meant to be accessible to other systems a common format is used.</li>
</ul>
<p>In practice all of these strategies will used to some extent.  However the last strategy, a common format, seems to offer the most efficient path both in terms of engineering effort and processing time.  This article will focus on the use of Avro&#8217;s data file format as such a common format.</p>
<h1>Avro</h1>
<p>Apache&#160;<a href="http://avro.apache.org/">Avro</a> is a data serialization format.  Avro shares many features with Google&#8217;s Protocol Buffers and Apache Thrift, including:</p>
<ul>
<li>Rich data types.</li>
<li>Fast, compact serialization.</li>
<li>Support for many programming languages.</li>
<li>Datatype evolution, also known as&#160;<em>versioning.</em></li>
</ul>
<p>Avro additionally provides some other features that are especially useful when storing data, namely:</p>
<ul>
<li>Avro defines a standard file format.  Avro data files are self-describing, containing the full schema for the data in the file.  Thus users can exchange Avro data files without also having to separately communicate metadata. &#160;Once an Avro data file is written, one will always be able to read it, with full datatype information, without relying on any external software or metadata repository. &#160;Avro data files also support compression, using Gzip or <a href="http://code.google.com/p/snappy/">Snappy</a> codecs. </li>
<li>Avro&#8217;s serialization is more compact.  Avro avoids storing a field identifier with each field value.  For some datasets this savings can be significant. </li>
<li>Avro implementations permit one to dynamically define new datatypes and to easily process previously unseen datatypes, without generation and loading of code.  This provides natural support for script and query languages. </li>
<li>Avro datatypes can define their sort-order, facillitating use of Avro data in MapReduce or ordered key/value stores. </li>
</ul>
<h1>Avro as a Common Format</h1>
<p>Most of the major ecosystem components already or will soon support reading and writing Avro data files:</p>
<ul>
<li>MapReduce: I added support for Java MapReduce programs, <a href="http://s.apache.org/o6">included</a> in Avro 1.4 and greater.</li>
<li><a href="http://hadoop.apache.org/common/docs/current/streaming.html">Streaming</a>: Tom White from Cloudera has added support for Hadoop Streaming programs to Avro (<a href="https://issues.apache.org/jira/browse/AVRO-808">AVRO-808</a> &amp;&#160;<a href="https://issues.apache.org/jira/browse/AVRO-830">AVRO-830</a>).</li>
<li><a href="http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/">Flume</a> 0.9.2 and above support collecting data in Avro&#8217;s format (<a href="https://issues.apache.org/jira/browse/FLUME-133">FLUME-133</a>), contributed by Jon Hsieh of Cloudera. &#160;Note also that Flume has recently been accepted into the Apache Incubator and will soon be known as Apache Flume.</li>
<li><a href="http://www.cloudera.com/blog/2009/06/introducing-sqoop/">Sqoop</a> 1.3 can import data as Avro data files in HDFS from a relational database (<a href="https://issues.cloudera.org/browse/SQOOP-207">SQOOP-207</a>), contributed by Tom White of Cloudera. &#160;Sqoop has also recently been accepted into the Apache Incubator.</li>
<li><a href="http://pig.apache.org/">Pig</a> release 0.9 will be able read and write Avro data files (<a href="https://issues.apache.org/jira/browse/PIG-1748">PIG-1748</a>), thanks to Lin Guo and Jakob Homan at LinkedIn. </li>
<li><a href="http://hive.apache.org/">Hive</a> support for reading and writing Avro data files has been <a href="https://github.com/jghoman/haivvreo#readme">posted</a> by Jakob Homan of LinkedIn, and should hopefully be included in Hive 0.9 (<a href="https://issues.apache.org/jira/browse/HIVE-895">HIVE-895</a>). </li>
<li><a href="http://incubator.apache.org/hcatalog/">HCatalog</a> input and output drivers have been contributed by Tom White of Cloudera (<a href="https://issues.apache.org/jira/browse/HCATALOG-49">HCATALOG-49</a>).</li>
<li>Thiruvalluvan M. G.&#160;from Yahoo! is working on a column-major format for Avro, which would accelerate Hive and Pig queries (<a href="https://issues.apache.org/jira/browse/AVRO-806">AVRO-806</a>).</li>
</ul>
<p>For folks who are currently using Protocol Buffers or Thrift to store data, some tools for conversion are planned:</p>
<ul>
<li>Raghu Angadi from Twitter is working on tools that will let folks     read and write their Thrift-defined data structures as Avro format data (<a href="https://issues.apache.org/jira/browse/AVRO-804">AVRO-804</a>).</li>
<li>We also hope to soon add tools to convert between Protocol Buffers and Avro (<a href="https://issues.apache.org/jira/browse/AVRO-805">AVRO-805</a>).</li>
</ul>
<p>At Cloudera we&#8217;re committed to helping Avro become a common format for the Hadoop ecosystem. &#160;It&#8217;s great to see so many other companies and individuals also investing in Avro.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/07/avro-data-interop/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Hadoop Lab at JavaOne</title>
		<link>http://www.cloudera.com/blog/2010/10/hadoop-lab-at-javaone/</link>
		<comments>http://www.cloudera.com/blog/2010/10/hadoop-lab-at-javaone/#comments</comments>
		<pubDate>Tue, 26 Oct 2010 18:47:46 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[apache hadoop]]></category>
		<category><![CDATA[CDH]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[javaone]]></category>
		<category><![CDATA[oracle openworld]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=5179</guid>
		<description><![CDATA[Guest post by Daniel Templeton, Product Manager at Oracle. Aside from JavaOne &#8217;10 having a new home as part of the greater&#160;Oracle OpenWorld conference, it was business as usual this year. Lots&#160;of great sessions, lots of interesting labs, and lots and lots of&#160;excited developers. (I think there may have even been more attendees&#160;than the conference [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;"><em><strong>Guest post by Daniel Templeton, Product Manager at Oracle.</strong></em></p>
<p style="text-align: justify;">Aside from JavaOne &#8217;10 having a new home as part of the greater&#160;Oracle OpenWorld conference, it was business as usual this year.  Lots&#160;of great sessions, lots of interesting labs, and lots and lots of&#160;excited developers.  (I think there may have even been more attendees&#160;than the conference planners expected.)  This year <a href="http://hadoop.apache.org">Hadoop</a> joined the ranks of the&#160;JavaOne hands-on labs with a lab co-produced by Oracle and&#160;Cloudera.</p>
<p style="text-align: justify;"><strong>JavaOne Hands-on Lab S314413: Extracting Real Value from Your&#160;Data With Apache Hadoop</strong> was offered as a two-hour interactive lab&#160;designed to introduce attendees to the Hadoop environment, including&#160;writing a MapReduce program, writing a custom input reader, running,&#160;monitoring, and managing Hadoop jobs, and working with the <a href="http://hadoop.apache.org/hive/">Hive data warehousing&#160;platform</a>.  The lab was designed for participants with at least&#160;some Java programming experience but not necessarily any prior&#160;exposure to Hadoop.</p>
<p style="text-align: justify;">In case you missed the lab at JavaOne, Oracle and Cloudera are both&#160;making the lab materials available online.  Oracle will post the&#160;materials as part of the greater JavaOne presentations posting.&#160;Cloudera has already <a href="http://training.cloudera.com/cloudera/S314413_hadoop.zip">posted&#160;the lab materials online</a> in the <a href="http://training.cloudera.com//">training section</a> of the&#160;website.</p>
<p style="text-align: justify;">When you download the zip file, in it you will find a lab workbook&#160;as a PDF in the root directory.  At the back of the workbook, you will&#160;find an appendix that describes how to set up your own lab&#160;environment.  I highly recommend that you grab the <a href="http://www.cloudera.com/downloads/">Cloudera Distribution for&#160;Hadoop (v2)</a> to use as an environment for the lab.  Cloudera even&#160;makes a prebuilt Linux/Hadoop environment available as <a href="http://cloudera-vm.s3.amazonaws.com/cloudera-training-0.3.3.tar.bz2">a&#160;virtual machine</a>.  The lab was written for Solaris 11 Express and&#160;NetBeans, but you should still be able to do the lab on another OS&#160;with another IDE.</p>
<p style="text-align: justify;">At JavaOne, the lab was very successful.  Turnout was good and the&#160;comments were great! I&#8217;ve already incorporated lots of great feedback&#160;from that session into the set of lab materials that Cloudera is now&#160;hosting, but I&#8217;m always happy to hear any additional comments and/or&#160;feedback.  Happy coding!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/10/hadoop-lab-at-javaone/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop World: NYC &#8211; Training</title>
		<link>http://www.cloudera.com/blog/2010/08/hadoopworld-training/</link>
		<comments>http://www.cloudera.com/blog/2010/08/hadoopworld-training/#comments</comments>
		<pubDate>Thu, 19 Aug 2010 15:00:23 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[training]]></category>
		<category><![CDATA[hadoopworld]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4354</guid>
		<description><![CDATA[Hadoop Training surrounding Hadoop World: NYC.]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify">Our vision for Hadoop World is a conference where both newcomers and experienced Hadoop users can learn and be part of the growing Hadoop community.</p>
<p style="text-align: justify">We are also offering training sessions for newcomers and experienced Hadoop users alike. Whether you are looking for an Introduction to Hadoop, Hadoop Certification, or you want to learn more about related Hadoop projects we have the training you are looking for.</p>
<p style="text-align: center"><img class="size-full wp-image-4403    aligncenter" src="http://www.cloudera.com/wp-content/uploads/2010/08/hw_white2.gif" alt="" width="169" height="130" /></p>
<p style="text-align: justify">Included with our top-notch Hadoop training you will have full access to Hadoop World free of charge.</p>
<p style="text-align: justify">Available Training Sessions include:<span id="more-4354"></span></p>
<h2 style="text-align: justify"><span style="text-decoration: underline">Oct 11:</span></h2>
<h3 style="text-align: justify"><em>Introduction to Hadoop</em>:&#160;<a href="http://www.eventbrite.com/event/762326138">http://www.eventbrite.com/event/762326138</a></h3>
<p style="text-align: justify">This one-day course provides a solid foundation for those seeking to understand large scale data processing with MapReduce and Hadoop. This session is designed for developers, analysts or system administrators that are new to Hadoop. This course provides the pre-requisite knowledge for the later classes: Developer Training, Administrator Training or Analyzing Data with Hive and Pig.</p>
<h3 style="text-align: justify"><em>Hadoop Essentials For Managers: </em><em> </em><a href="http://www.eventbrite.com/event/762237874">http://www.eventbrite.com/event/762237874</a></h3>
<p style="text-align: justify">This one-day course will give decision-makers the information they need to know about Apache Hadoop, answering questions such as:</p>
<ul style="text-align: justify">
<li>When is Hadoop appropriate?</li>
<li>What are people using Hadoop      for?</li>
<li>How does Hadoop fit into our      existing environment?</li>
<li>What do I need to know about      choosing Hadoop?</li>
</ul>
<h3 style="text-align: justify"><em>Cloudera HUE SDK Training</em>:&#160;<a href="http://www.eventbrite.com/event/764021208">http://www.eventbrite.com/event/764021208</a></h3>
<p style="text-align: justify">Cloudera Hue provides developers with back end APIs to simplify interacting with Hadoop and front end APIs to deliver rich, web based, graphical user experiences. For this training, developers should have experience building web apps using modern MVC frameworks and Ajax. Experience with Python and Django is a strong plus. In this session we spend half the day covering the following topics, and the other half of the day interactively building applications with the Cloudera Hue team.</p>
<h2 style="text-align: justify"><span style="text-decoration: underline">Oct 13 &amp; 14:</span></h2>
<h3 style="text-align: justify"><em>Developer Training &amp; Certification</em>:&#160;<a href="http://www.eventbrite.com/event/762320120">http://www.eventbrite.com/event/762320120</a></h3>
<p style="text-align: justify">In this two-day hands-on session, developers learn the MapReduce framework and how to write programs against its API. In addition to learning how to write individual MapReduce jobs, we discuss design techniques for larger workflows. This course also covers advanced skills for debugging MapReduce programs and optimizing their performance. At the end of the course, attendees have the option to take a certification exam documenting their understanding of the concepts taught during the training session.</p>
<h3 style="text-align: justify"><em>Administrator Training &amp; Certification:</em> <a href="http://www.eventbrite.com/event/762677188">http://www.eventbrite.com/event/762677188</a></h3>
<p style="text-align: justify">This two-day hands-on session covers the system administration aspects of Hadoop from installation and configuration to load balancing and tuning including diagnosing and solving problems in your deployment. At the end of the course, attendees have the option of taking a certification exam documenting their understanding of the concepts taught at the training session.</p>
<h3 style="text-align: justify"><em>Analyzing Data with Hive and Pig:</em> <a href="http://www.eventbrite.com/event/762318114">http://www.eventbrite.com/event/762318114</a></h3>
<p style="text-align: justify">Cloudera&#8217;s two-day hands-on course on Hive and Pig is designed for people who have a basic understanding of how Hadoop works and want to utilize these languages for analysis of their data. Hive makes Hadoop accessible to users who already know SQL; Pig is similar to popular scripting languages. This course teachs you how to process data by using filters, joins, user-defined functions and more.</p>
<h2 style="text-align: justify"><span style="text-decoration: underline">Oct 15:</span></h2>
<h3 style="text-align: justify"><em>HBase Training</em>:&#160;<a href="http://www.eventbrite.com/event/762317111">http://www.eventbrite.com/event/762317111</a></h3>
<p style="text-align: justify">This one-day hands-on course gives you the necessary knowledge for using HBase as a distributed data store to achieve low-latency queries and highly scalable throughput. This class covers the HBase architecture, data model, and Java API as well as advanced topics and best practices. This course is for developers who already have a basic understanding of Hadoop (Java experience is recommended).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/08/hadoopworld-training/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Announcing Two New Training Classes from Cloudera: Introduction to HBase and Analyzing Data with Hive and Pig</title>
		<link>http://www.cloudera.com/blog/2010/07/announcing-two-new-training-classes-from-cloudera-introduction-to-hbase-and-analyzing-data-with-hive-and-pig/</link>
		<comments>http://www.cloudera.com/blog/2010/07/announcing-two-new-training-classes-from-cloudera-introduction-to-hbase-and-analyzing-data-with-hive-and-pig/#comments</comments>
		<pubDate>Wed, 28 Jul 2010 13:52:22 +0000</pubDate>
		<dc:creator>John Kreisa</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[administration]]></category>
		<category><![CDATA[developer]]></category>
		<category><![CDATA[training]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4177</guid>
		<description><![CDATA[Cloudera is pleased to announce two new training courses: a one-day Introduction to HBase and a two-day session on Analyzing Data with Hive and Pig. These join a recently-expanded two-day Hadoop for Administrators course and our popular three-day Hadoop for Developers offering, any of which can be combined to provide extensive, customized training for your [...]]]></description>
			<content:encoded><![CDATA[<p>Cloudera is pleased to announce two new training courses: a one-day Introduction to HBase and a two-day session on Analyzing Data with Hive and Pig. These join a recently-expanded two-day Hadoop for Administrators course and our popular three-day Hadoop for Developers offering, any of which can be combined to provide extensive, customized training for your organization. Please contact <a href="mailto:sales@cloudera.com">sales@cloudera.com</a> for more information regarding on-site training, or visit <a href="http://www.cloudera.com/hadoop-training">www.cloudera.com/hadoop-training</a> to view our public course schedule.</p>
<p>Cloudera&#8217;s HBase course discusses use-cases for HBase, and covers the HBase architecture, schema modeling, access patterns, and performance considerations. During hands-on exercises, students write code to access HBase from Java applications, and use the HBase shell to manipulate data. Introduction to HBase also covers deployment and advanced features.</p>
<p>Our Hive and Pig course is designed for developers who are skilled with SQL or scripting languages, but who are not Java experts. Hive and Pig are two approaches which allow non-Java programmers to access and manipulate massive amounts of data while abstracting away the complexities of MapReduce. Hive offers an SQL-like interface, while Pig&#8217;s scripting language, named PigLatin, is very easy for developers learn. This course covers both technologies, and includes multiple hands-on exercises to reinforce key concepts.</p>
<p>Cloudera&#8217;s Hadoop for System Administrators course has recently been expanded from one day to two, and covers the important issues for System Administrators charged with looking after Hadoop clusters. Topics include planning and deploying the cluster, managing MapReduce jobs, scheduling jobs using the Fair Scheduler, cluster monitoring and troubleshooting, populating HDFS from existing relational database management systems with Sqoop, and using Flume to import logs and other files into HDFS.</p>
<p>Our most popular course, Hadoop for Developers, is a three-day offering which covers everything from an introduction to HDFS and MapReduce right through to advanced MapReduce APIs and algorithms. Students learn to build MapReduce jobs through a combination of instructor-led training and hands-on exercises; the course includes an exam offering students the chance to earn Cloudera Certified Hadoop Developer credentials.</p>
<p>A complete list of events including upcoming training is available here: <a href="http://www.cloudera.com/company/events/">http://www.cloudera.com/company/events/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/07/announcing-two-new-training-classes-from-cloudera-introduction-to-hbase-and-analyzing-data-with-hive-and-pig/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Thrift, Scribe, Hive, and Cassandra: Open Source Data Management Software</title>
		<link>http://www.cloudera.com/blog/2008/10/thrift-scribe-hive-and-cassandra-open-source-data-management-software/</link>
		<comments>http://www.cloudera.com/blog/2008/10/thrift-scribe-hive-and-cassandra-open-source-data-management-software/#comments</comments>
		<pubDate>Fri, 24 Oct 2008 19:11:07 +0000</pubDate>
		<dc:creator>Jeff Hammerbacher</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[cassandra]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[scribe]]></category>
		<category><![CDATA[thrift]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/blog/?p=11</guid>
		<description><![CDATA[Apache Hadoop exists within a rich ecosystem of tools for processing and analyzing large data sets. At Facebook, my previous employer, we contributed a few projects of note to this ecosystem, all under the Apache 2.0 license: Thrift: A cross-language RPC framework that powers many of Facebook&#8217;s services, include search, ads, and chat. Among other [...]]]></description>
			<content:encoded><![CDATA[<p><a title="Apache Hadoop" href="http://www.hadoop.com">Apache Hadoop</a> exists within a rich ecosystem of tools for processing and analyzing large data sets. At Facebook, my previous employer, we contributed a few projects of note to this ecosystem, all under the <a title="Apache 2.0 License" href="http://www.apache.org/licenses/LICENSE-2.0.html">Apache 2.0 license</a>:</p>
<ul></ul>
<ul>
<li><a title="Apache Thrift" href="http://incubator.apache.org/thrift">Thrift</a>: A cross-language RPC framework that powers many of Facebook&#8217;s services, include search, ads, and chat. Among other things, Thrift defines a compact binary serialization format that is often used to persist data structures for later analysis.</li>
<li><a title="Apache Scribe" href="http://sourceforge.net/projects/scribeserver/">Scribe</a>: A Thrift service for distributed logfile collection. Scribe was designed to run as a daemon process on every node in your data center and to forward log files from any process running on that machine back to a central pool of aggregators. Because of its ubiquity, a major design point was to make Scribe consume as little CPU as possible.</li>
<li><a title="Apache Hive" href="http://wiki.apache.org/hadoop/Hive">Hive</a>: Once the data has been serialized using Thrift and collected using Scribe, it can be loaded into a Hadoop cluster for analysis. Running Hive above your Hadoop cluster will allow you to query the data using a SQL-like syntax; Hive will also manage the partitioning of logs inside the Hadoop Distributed File System.</li>
<li><a title="Cassandra" href="http://code.google.com/p/the-cassandra-project/">Cassandra</a>: If you&#8217;ve got millions of users requesting and updating data, Cassandra can help you scale with your community. Cassandra was designed to power inbox search at Facebook and is now storing an index of around 35 TB. Design points included incremental scalability and low system administration overhead; Cassandra could be useful in many places where a horizontally partitioned (&#8220;sharded&#8221;) MySQL instance is currently deployed.</li>
</ul>
<ul></ul>
<p>I was recently invited by Robert Grossman of <a title="Open Data" href="http://www.opendatagroup.com">Open Data</a> to speak about these projects at the inaugural <a title="Cloud Computing and Its Applications" href="http://www.cca08.org">Cloud Computing and Its Applications</a> conference in Chicago. You can check out the slides from my talk below:<br />
<object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="425" height="355" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowFullScreen" value="true" /><param name="allowScriptAccess" value="always" /><param name="src" value="http://static.slideshare.net/swf/ssplayer2.swf?doc=20081022cca-1224867567253598-9&amp;stripped_title=20081022cca-presentation" /><embed type="application/x-shockwave-flash" width="425" height="355" src="http://static.slideshare.net/swf/ssplayer2.swf?doc=20081022cca-1224867567253598-9&amp;stripped_title=20081022cca-presentation" allowscriptaccess="always" allowfullscreen="true"></embed></object></p>
<p>All of these projects have small but growing user communities. I hope you&#8217;ll find them useful for your data management projects, and I look forward to seeing a few new users on the mailing lists soon.</p>
<p>&#8211; Jeff Hammerbacher, VP Product and Chief Scientist, Cloudera</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2008/10/thrift-scribe-hive-and-cassandra-open-source-data-management-software/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

