<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; careers</title>
	<atom:link href="http://www.cloudera.com/blog/category/careers/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Thu, 24 May 2012 17:53:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Caching in HBase: SlabCache</title>
		<link>http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/</link>
		<comments>http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/#comments</comments>
		<pubDate>Fri, 06 Jan 2012 13:00:52 +0000</pubDate>
		<dc:creator>Li Pi</dc:creator>
				<category><![CDATA[careers]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Cloudera careers]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9607</guid>
		<description><![CDATA[This was my summer internship project at Cloudera, and I&#8217;m very thankful for the level of support and mentorship I&#8217;ve received from the HBase community. I started off in June with a very limited knowledge of both HBase and distributed systems in general, and by September, managed to get this patch committed to HBase trunk. I couldn&#8217;t [...]]]></description>
			<content:encoded><![CDATA[<p><em>This was my summer internship project at Cloudera, and I&#8217;m very thankful for the level of support and mentorship I&#8217;ve received from the HBase community. I started off in June with a very limited knowledge of both HBase and distributed systems in general, and by September, managed to get this patch committed to HBase trunk. I couldn&#8217;t have done this without a phenomenal amount of help from Cloudera and the greater HBase community.</em></p>
<h2>Background</h2>
<p>The amount of memory available on a commodity server has increased drastically in tune with Moore’s law. Today, its very feasible to have up to 96 gigabytes of RAM on a mid-end, commodity server. This extra memory is good for databases such as HBase which rely on in memory caching to boost read performance.</p>
<p>However, despite the availability of high memory servers, the garbage collection algorithms available on production quality JDK&#8217;s have not caught up. Attempting to use large amounts of heap will result in the occasional stop-the-world pause that is long enough to cause stalled requests and timeouts, thus noticeably disrupting latency sensitive user applications.</p>
<h2>Garbage Collection</h2>
<p>The below is meant to be a quick summary of an immensely complex topic, if you would like a more detailed explanation of garbage collection, <a href="http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/" target="_blank">check out this post</a>.</p>
<p>HBase, along with the rest of the Apache Hadoop ecosystem, is built in Java. This gives us access to an incredibly well-optimized virtual machine and an excellent mostly-concurrent garbage collector in the form of Concurrent-Mark-Sweep (CMS). However, large heaps remain a weakness, as CMS collects garbage without moving it around, potentially causing the free space to be spread throughout the heap instead of in a large contiguous chunk. Given enough time, fragmentation will require a full, stop the world, garbage collection with a copying collector capable of relocating objects. This results in a potentially long stop-the-world pause, and acts as a practical limit to the size of our heap.</p>
<p>Garbage collectors which do not require massive stop the world compactions do exist, but are not presently suitable for use with HBase at the moment. The Garbage-First (G1) collector included in recent versions of the JVM, is one promising example, but early testing still indicates that it exhibits some <a href="http://www.quora.com/What-is-the-status-of-the-implementation-of-the-G1-garbage-collector-for-the-JVM/answer/Todd-Lipcon" target="_blank">flaws</a>. JVMs from other (non-Oracle) vendors which offer low-pause concurrent garbage collectors also exist, but they are not in widespread use by the HBase and Hadoop communities.</p>
<h2>The Status Quo</h2>
<p>Currently, in order to utilize all available memory, we allocate a smaller heap and let the OS utilize the rest of the memory. In this case, the memory isn’t wasted &#8211; it’s used by the filesystem cache. While this does give a noticable performance improvement, it has its drawbacks. Data in the FS cache is also treated as a file, requiring us to checksum, and verify the file.  We also have no guarantee what the FileSystem cache will do and have only limited control over the eviction policy of this cache. While the Linux FS cache is nominally a LRU cache, other processes or jobs running on the system may flush our cache, adversely impacting performance. The FS cache is better than putting the memory to waste, but it&#8217;s neither the most efficient, nor the most consistent solution.</p>
<h2>Enter SlabCache</h2>
<p>Another option would be to manually manage the cache within Java via Slab Allocation &#8211; opting to avoid garbage collection all together. This is the approach I implemented in <a href="https://issues.apache.org/jira/browse/HBASE-4027" target="_blank">HBASE-4027</a>.</p>
<p>SlabCache operates by allocating a large quantity of contiguous memory, and then performing <a href="http://en.wikipedia.org/wiki/Slab_allocation" target="_blank">Slab Allocation</a> within that block of memory. Buffers of likely sizes of cached objects are first allocated in advance &#8211; objects are fit into the smallest buffer available that can contain them upon caching.  Effectively, any fragmentation issues are internalized by the cache, trading off some space in order to avoid any external fragmentation issues. As blocks generally converge around a single size, this method can still be quite space efficient.</p>
<h2>Implementation</h2>
<p>While slab allocation does not create fragmentation, other parts of HBase still can. With Slab Allocation, the frequency of stop-the-world(STW) pauses may be reduced, but the worst case maximum pause time isn&#8217;t &#8211; The JVM can still decide to move our entire slab around if we happen to be really unlucky, contributing again to significant pauses. In order to prevent this, SlabCache allocates its memory using <em>direct ByteBuffers</em>.</p>
<p>Direct ByteByffers, available in the java.nio package, are allocated outside of the normal Java heap &#8212; just like using malloc() in a C program. The garbage collector will not move memory allocated in this fashion &#8211; guaranteeing that a direct ByteBuffer will never contribute to the maximum garbage collection time. The ByteBuffer “wrapper” is then registered as an object, which when collected, is released back into the system using free.</p>
<p>Reads are performed using a copy-on-read approach. Every time HBase does a read from SlabCache, the data is copied out of the SlabCache and onto the heap. While passing by reference would have been the more performant solution, that would have required some way of carefully tracking references to these objects. I decided against reference counting, as reference counting opens up the potential for an entirely new class of bugs, making continuing work on HBase more difficult.  Solutions involving finalizers or reference queues, were also discarded, as neither of them guarantee timely operation. In the future, I may decide to revisit reference counting if necessary to increase read speed.</p>
<p>SlabCache operates as an L2 cache, replacing the FS cache in this role. The on-heap cache is maintained as the L1 cache. This solution allows us to use large amounts of memory with a substantial speed and consistency performance over the status quo, while at the same time ameliorating the downsides of the copy-on-read approach. Because the vast majority of our hits will come from the on-heap L1 cache, we do a minimum of copying data and creating new objects.</p>
<h2>Performance</h2>
<p>SlabCache operates at around 3-4x the performance of the file system cache, and also provides more consistent performance.</p>
<p>Performance comparisons of the 3 caches as followed. In each test, each cache was configured so that it was the primary (L1), and only cache of HBase. YCSB was then run against HBase-trunk.</p>
<p><a href="https://www.cloudera.com/wp-content/uploads/2011/11/chart_1.png"><img class="alignnone size-full wp-image-9616" src="https://www.cloudera.com/wp-content/uploads/2011/11/chart_1.png" alt="" width="600" height="371" /></a></p>
<p>HBase in all cases was running in Standalone mode, compiled against 0.20-append branch. As HDFS has gotten faster since the last release, I&#8217;ve also provided tests with the RawLocalFS, in order to isolate the difference between accessing the Slab cache and accessing the FS cache by removing HDFS from the equation. In this mode, CRC is turned off, and the local filesystem (ext3) is used directly. Even given these optimal conditions, SlabCache still nets a considerable performance gain.</p>
<p>If you&#8217;d like to try out this code in trunk, simply set MaxDirectMemorySize in hbase-env.sh. This will automatically configure configure the cache to use 95% of the MaxDirectMemorySize, and set reasonable defaults for the Slab Allocator. If finer control is desired, you are free to change the SlabCache settings in hbase-site.xml, which will allow you to have finer control over off-heap memory usage and slab allocation sizing.?</p>
<h2>Conclusion</h2>
<p>If you&#8217;re running into read performance walls with HBase, and have extra memory to spare, then please give this feature a try! This is due to be released in HBase-0.92 as an experimental feature, and will hopefully enable the more efficient usage of memory.</p>
<p>I had an amazing summer working on this project, and as an intern, I&#8217;m awed to see this feature work and be released publicly. If you found this post interesting, and would like to work on problems like this, check out the <a href="http://www.cloudera.com/company/careers/">careers</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>How I found Hadoop</title>
		<link>http://www.cloudera.com/blog/2011/12/how-i-found-hadoop/</link>
		<comments>http://www.cloudera.com/blog/2011/12/how-i-found-hadoop/#comments</comments>
		<pubDate>Wed, 28 Dec 2011 13:00:44 +0000</pubDate>
		<dc:creator>Omer</dc:creator>
				<category><![CDATA[careers]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[guest]]></category>
		<category><![CDATA[apache hadoop]]></category>
		<category><![CDATA[Finding Hadoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=10019</guid>
		<description><![CDATA[This is a guest post contributed by Loren Siebert. Loren is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program. A year ago I rolled my first Hadoop system into production. Since then, I&#8217;ve spoken to quite a few people who are eager to try Hadoop themselves [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is a guest post contributed by Loren Siebert. Loren is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.</em></p>
<p>A year ago I rolled my first Hadoop system into production. Since then, I&#8217;ve spoken to quite a few people who are eager to try Hadoop themselves in order to solve their own big data problems. Despite having similar backgrounds and data problems, few of these people have sunk their teeth into Hadoop. When I go to <a href="http://www.meetup.com/hadoopsf/" target="_blank">Hadoop Meetups in San Francisco</a>, I often meet new people who are evaluating Hadoop and have yet to launch a cluster. Based on my own background and experience, I have some ideas on why this is the case.</p>
<p>I studied computer science in school and have worked on a wide variety of computer systems in my career, with a lot of focus on server-side Java. I learned a bit about building distributed systems and working with large amounts of data when I built a pay-per-click (PPC) ad network in 2004. The system is still in operation and at one point was handling several thousand searches per second. As the sole technical resource on the system, I had to educate myself very quickly about how to scale up.</p>
<p>As I contemplated how doomed I would be should traffic levels increase much more, I remember wondering to myself, &#8220;How does Google deal with all that data?&#8221; The answer came to me in the form of the Google File System (GFS) paper and later the MapReduce paper, both from Google. It dawned on me that because Google was forced to solve a much larger problem, they had come up with an elegant solution for a whole range of more modest data problems running on commodity hardware. But it wouldn&#8217;t be until 2010 that I would get to work with this technology firsthand.</p>
<p>As I wrote in an <a href="http://www.cloudera.com/blog/2011/04/adopting-apache-hadoop-in-the-federal-government/" target="_blank">earlier article</a>, I started re-architecting USASearch, the U.S. government&#8217;s search system, in 2009 based on a solution stack of free, open source software including Ruby on Rails, Solr, and MySQL. A wave of d&#233;ja vu hit me as I started worrying about what to do with the growing mountain of data piling up in MySQL and our increasing need to analyze it in different ways. I had heard that a new company called Cloudera, founded by some big data people from Yahoo!, Google, and Facebook, was making Hadoop available for the masses in a reliable distribution, much in the same way that RedHat did for Linux. Curiosity got the best of me and I bought the newly minted <em>Hadoop: The Definitive Guide</em> from O&#8217;Reilly. The most insightful part of the book to me was the very first sentence. It&#8217;s a quote from Grace Hopper: &#8220;In pioneer days, they used oxen for heavy pulling, and when one ox couldn&#8217;t budge a log, they didn&#8217;t try to grow a bigger ox.&#8221; I didn&#8217;t want to grow a bigger server; I wanted to harness a bunch of small servers together to work in unison. The more I learned the more curious I got, so I started reading more. And that&#8217;s when I hit my first roadblock.</p>
<p>I think people who have been working with Hadoop technologies for years and years sometimes forget just how rich and diverse the big data software ecosystem has become, and how daunting it can be to folks approaching it for the first time. When people at the Meetups say they are evaluating solutions to their data scaling problem, the answers they hear sound something like this: &#8220;Just use Hadoop Hive Pig Mahout Avro HBase Cassandra Oozie Sqoop Flume ZooKeeper Cascading NoSQL RCFile. Oh, almost forgot&#8230;cloud.&#8221;</p>
<p>The thought of wading through all of that just to learn about what I needed to learn about was a bit too overwhelming for me, so I put the whole matter aside for a few months. Over time, I started to dive into each of these projects to understand the primary use case, how active the developer community was and which organizations were using it in production. I converged on the idea of using Hive as a warehouse for our data. I opted for <a href="https://ccp.cloudera.com/display/SUPPORT/Downloads" target="_blank" title="Cloudera's Distribution including Apache Hadoop">Cloudera&#8217;s distribution</a> since I wanted to reduce the risk of running into compatibility issues between all the various subsystems. Having tracked down anomalies in a highly multi-threaded and contentious distributed Java system before, I liked the idea of someone else taking on that problem for me.</p>
<p>At some point, I had read everything I could read and grew impatient to get my hands dirty, so I decided to just download <a href="https://ccp.cloudera.com/display/SUPPORT/Downloads"><abbr title="Cloudera's Distribution including Apache Hadoop 3">CDH3</abbr></a> on my laptop and give it a try. The <a href="https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial">tutorial</a> instructions for the standalone version worked, and I successfully computed more digits of pi than I ever thought I&#8217;d need. After creating some sample data in Hive and running a few queries, I felt pretty confident that Hive would be the right tool for the job. I just needed to find somewhere to install and run HDFS (namenode, secondary namenode, and data nodes), Hadoop (jobtracker and tasktracker nodes), Hive, and Hue for a nice front end to it all.</p>
<p>I knew from my past experience how to stretch the limits of CPU, disk, IO, and memory on commodity servers, and I identified a few potential servers at our primary datacenter with resources I figured I could leverage. Once again I followed the tutorial instructions, this time for the fully distributed version of CDH3, and once again I started to compute pi. And that&#8217;s when I hit my second roadblock. It took me a few days&#160;to figure out that I had a problem with DNS. Each machine needs to be able to resolve every other machine&#8217;s name and IP in the cluster. Whether you do that via /etc/hosts or a local DNS server is up to you, but it needs to happen or the whole thing gets wedged. Once I got that sorted out, everything just started falling into place and I had Hive working in production within a few days. A week later, I started pulling out the MySQL jobs and deleting big tables, and that&#8217;s been the trend ever since.</p>
<p>Over time, I&#8217;ve gone on to learn about using custom Ruby mappers in Hive, moving data back and forth between MySQL and Hive with Sqoop, and getting the data into HDFS in real-time with Flume. All of these components from the <a href="https://ccp.cloudera.com/display/SUPPORT/Downloads" target="_blank" title="Cloudera's Distribution including Apache Hadoop">Cloudera distribution</a> are working nicely in our production environment now, and I sleep well at night knowing I have such a solid, deliberate plan for growth. My initial investment in learning about the Hadoop ecosystem is really paying dividends, but when I think about all those people at the Meetups stuck in evaluation mode, I feel their pain. Does it have to be such a struggle?</p>
<p>The big challenge in my opinion is not that any one piece of the puzzle is too difficult. Any reasonably smart (or in my case stubborn) engineer can set themselves on the task of learning about a new technology once they know that it needs to be learned. The challenge with the Hadoop ecosystem is that it presents the newbie with the meta-problem of figuring out which of these tools are appropriate for their use case at all, and whether or not to even consider the problem today versus deferring it until later. In a way Facebook has it easy, because when you are adding 15TB of data per day, that decision is pretty much made for you.</p>
<p>For all the companies sitting in the twilight between the gigabyte and the petabyte who don&#8217;t have Hadoop expertise in-house, there is a collection of free information to help guide people to the right solution space (<a href="https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial">Hadoop Tutorial</a>, <a href="http://www.cloudera.com/resources/White+Paper/">White Papers</a>). These days, when I talk to people who are evaluating solutions to their big data problems, my advice to them is to break down their problems into a few discrete use cases and then work on ferreting out the technologies that are designed for that use case. Get a proof of concept to demonstrate that the technology can address your use case and convince yourself and others that you&#8217;re on the right track. Work toward putting something simple into production. Lather, rinse, and repeat. I am still in that cycle myself, as these days I&#8217;m exploring <a href="http://hbase.apache.org/" target="_blank">HBase</a> and <a href="http://opentsdb.net/" target="_blank">OpenTSDB</a> to give me low-latency access to time series data and <a href="http://mahout.apache.org/" target="_blank">Mahout</a> to do frequent item set mining, but that&#8217;s another article for another day.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/12/how-i-found-hadoop/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>My Internship at Cloudera</title>
		<link>http://www.cloudera.com/blog/2011/12/my-internship-at-cloudera/</link>
		<comments>http://www.cloudera.com/blog/2011/12/my-internship-at-cloudera/#comments</comments>
		<pubDate>Tue, 20 Dec 2011 13:00:14 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[careers]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[Cloudera Internship]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9989</guid>
		<description><![CDATA[David joined us as part of our intern program, and built the prototype for the distributed log search functionality that&#8217;s available as part of Cloudera Manager 3.7. He did an awesome job, and wrote the following blog post which, now that CM3.7 has been released, we&#8217;re pleased to publish. The project My intern project was [...]]]></description>
			<content:encoded><![CDATA[<p><em>David joined us as part of our <a href="http://www.cloudera.com/company/careers/" title="Careers">intern program</a>, and built the prototype for the distributed log search functionality that&#8217;s available as part of Cloudera Manager 3.7. He did an awesome job, and wrote the following blog post which, now that CM3.7 has been released, we&#8217;re pleased to publish.</em></p>
<h2>The project</h2>
<p>My intern project was to build a log searching tool, specialized for Apache Hadoop. My mini-app allows Hadoop cluster admins and operators to search their error logs across many machines, filter by time range, text in the log message, and find the namenode machine, for example. The results are then ordered by time, and shown to the user.</p>
<p>This project was inspired by the extreme wizardry required to search logs with traditional tools, such as grep and ssh (or parallel ssh), especially since these tools do not order the results by time. Ordering by time is very important, as it allows one to triage the sources of failures across your cluster, and figure out where it all started.</p>
<h2>How do I feel about my project in retrospect?</h2>
<p>I had a ton of independence when it came to building my app. As part of the Enterprise Team, which wrote Cloudera Manager allowing one to easily deploy Hadoop to a whole cluster within a few clicks, I wrote the REST API available on the individual machines of the cluster. I wrote the master server code that makes requests in parallel to each of the cluster machines, asking for their search results. I designed and wrote the UI for the app, in addition to conceiving ways to make life easier for users who interact with it (with some help from user-testing, of course).</p>
<p>A couple of the niceties I added to the search page include:</p>
<ul>
<li>Search without page refresh.</li>
<li>Saves the current search&#8217;s options to the URL bar, so that if you send the URL to a fellow admin, they can run the exact same search and see exactly what you&#8217;re seeing. Or you can save the URL and re-run the search at a later date.</li>
<li>A context view that does the same thing as the above, but lets you browse a single log file, with pagination (a single day&#8217;s log file can get as big as 1GB, so it wouldn&#8217;t be a great idea to send it all to the client at the same time).</li>
</ul>
<p>In this process I learned more about python, the difficulty that python&#8217;s built-in date capabilities can cause, and how it can be quite fun to run code distributed across hundreds or thousands of machines.</p>
<p>I also spent some time profiling the internal Cloudera log searching library (written by <a href="http://www.linkedin.com/pub/adam-warrington/2/295/70">Adam Warrington</a>) which is the workhorse of the REST API (the master server communicates with its minions over HTTP). We were able to cut the <em>worst</em>-case run time on sample data by ~88%, which made me happy. During the process, I learned when possible, it&#8217;s best to meta-program other people&#8217;s code by asking them to make it faster, as the process of learning and reading all the code they&#8217;ve written can take some time, especially when you only need to make what you hope is a small change. It&#8217;s really great to arrive at work and hear that someone else has just finished coding up the optimizations your app needed.</p>
<h2>Technical Portion: how the log search feature works.</h2>
<p>Whenever you run a search, the main page of the search UI makes a request to a JSON endpoint, asking for log search results from, say, yesterday on all datanodes in the cluster. This request reaches the master server (SCM), which knows all about the machines in the cluster. The master server has a number of threads which make requests in parallel to each of the applicable cluster machines, each of which exposes a JSON endpoint. Each individual cluster machine then runs some python code that searches the relevant log files and returns the result as JSON. The master server collates the results from each cluster machine, and returns this to the browser. The results are then displayed to the user in what can sometimes be a <em>very</em> long list.</p>
<p>We decided it would be best not to maintain an index on the log files, as there can be many terabytes of data to sift through. For this reason, searches are done on demand by each individual cluster machine. Searches which include a time range are quite fast, as binary search is used to find the relevant time range, and then the first 20 results are returned. We also made an effort to optimize searches similar to <code>grep helloworld</code> that filter out certain words when we scan the particular line for the word, and skip the line without parsing it into an event if that line does not contain <code>helloworld</code>. We made this optimization because parsing each log event into date, message, and source was quite slow when searching large files.</p>
<p>Because I wrote the three components that make the search work (UI, JSON route on master server, JSON route on cluster machines), I got a good overview of many aspects of the code base.</p>
<h2>A brief overview of the indirectly-dev related skills I learned a Cloudera</h2>
<p>I&#8217;ve learned git really well, which I totally love now. I can rebase, cherry-pick, :/search, and reflog like the best of them. My git skills could be considered quite fetching among certain branches of society.</p>
<p>While here I&#8217;ve also had the opportunity to really flesh out my <a href="https://github.com/dtrejo/dotfiles">dotfiles</a>, especially my <code>.gitconfig</code> and my <code>.profile</code> (aka <code>.bashrc</code>).</p>
<p>I also got a real feel for Dojo while I&#8217;ve been here, and I&#8217;d say that my next choice of javascript toolkit will be much better informed because of this.</p>
<p>Code reviews kept my code quality up, helping me catch little things like comments that were no longer relevant. I have yet to master the art of spotting and veto-ing changes that will break what I&#8217;ve written. Maybe next year!</p>
<h2>Dev environment</h2>
<p>From the web developer&#8217;s point of view, there&#8217;s are a couple times when you&#8217;d normally need to do a couple extra alt-tabs and/or refreshes. Of course, I&#8217;m pampered because I&#8217;ve never had to build anything that goes into production in a massive and/or distributable way, as we do here at Cloudera with our management tools. Also, I&#8217;ve traditionally done my work with web-servers that don&#8217;t require one to compile source code.</p>
<p>I found the following things quite annoying about the dev environment:</p>
<ol>
<li>static files, when changed, would not show their changes upon refresh</li>
<li>.less files, an advanced and superior alternative to CSS, would not<br />
auto-recompile upon change</li>
</ol>
<p>A solution to the first one was found by a coworker. I fixed the second problem by writing a little node script that watches less files and recompiles when any of them change.</p>
<p>While I was at it, I also made it easy for devs to use <a href="http://aboutcode.net/vogue/">vogue</a>, which reloads stylesheets whenever they&#8217;ve changed, without requiring a page refresh. This further improves development, as pages can get quite heavy when in development mode, where every javascript file is loaded individually, and it&#8217;s nice to have CSS changes automatically reflected in the UI.</p>
<h2 id="thanks_cloudera">Thanks Cloudera!</h2>
<p>That&#8217;s about all I have on my mind when it comes to my internship. I learned a <em>ton</em>, enjoyed the free lunches a lot, as well as the 30 inch monitors. These things make a big difference, and also make me feel way cooler than some of my friends who don&#8217;t get these things.</p>
<p>So long Cloudera and thanks for all the fish! Now I&#8217;m off to another planet for a year in the world of academia!</p>
<p><a href="http://dtrejo.com/">David Trejo</a><br />
Software Engineer Intern<br />
Cloudera Summer &#8216;11 Enterprise Team<br />
Brown University Computer Science &#8216;13</p>
<p><a href="http://www.cloudera.com/company/careers/">Go to Cloudera Careers ></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/12/my-internship-at-cloudera/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Hadoop World 2011: A Glimpse into Development</title>
		<link>http://www.cloudera.com/blog/2011/10/hadoop-world-2011-a-glimpse-into-development/</link>
		<comments>http://www.cloudera.com/blog/2011/10/hadoop-world-2011-a-glimpse-into-development/#comments</comments>
		<pubDate>Wed, 12 Oct 2011 13:00:42 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[careers]]></category>
		<category><![CDATA[CDH]]></category>
		<category><![CDATA[Cloudera's Service and Configuration Manager]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[Connector]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[Flume]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[oozie]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[sqoop]]></category>
		<category><![CDATA[training]]></category>
		<category><![CDATA[Use Case]]></category>
		<category><![CDATA[ZooKeeper]]></category>
		<category><![CDATA[hadoop conference]]></category>
		<category><![CDATA[hadoop event]]></category>
		<category><![CDATA[hadoop world]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9240</guid>
		<description><![CDATA[The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hadoopworld.com/"><img style="float: left; padding-right: 20px;" title="Register for Hadoop World" src="https://www.cloudera.com/wp-content/uploads/2010/08/hw_white2.gif" alt="" /></a></p>
<p>The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.</p>
<h2 style="font-size: 14pt; color: #344152;"><a href="http://www.hadoopworld.com/tracks/development-developers/" target="_blank">Preview of Development Track Sessions</a></h2>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Building Web Analytics Processing on Hadoop at CBS Interactive</span></a><br />
 <em>Michael Sun, CBS Interactive</em></p>
<p><strong>Abstract:</strong> CBS Interactive successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties that CBS Interactive oversees. After introducing Lumberjack&#8212;the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release&#8212;Michael will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, CBS Interactive achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).</p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Gateway: Cluster Virtualization Framework</span></a><br />
<em>Konstantin Shvachko, eBay</em></p>
<p><strong>Abstract:</strong> Access to Hadoop clusters through dedicated portal nodes (typically located behind firewalls and performing user authentication and authorization) can have several drawbacks &#8212; as shared multitenant resources they can create contention among users and increase the maintenance overhead for cluster administrators. This session will discuss the Gateway system, a cluster virtualization framework that provides multiple benefits: seamless access from users&#8217; workplace computers through corporate firewalls; the ability to failover to active clusters for scheduled or unscheduled downtime, as well as the ability to redirect traffic to other clusters during upgrades; and user access to clusters running different versions of Hadoop. </p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">SHERPASURFING &#8211; Open Source Cyber Security Solution</span></a><br />
<em>Wayne Wheeles, Novii Design</em></p>
<p><strong>Abstract:</strong> Every day billions of packets, both benign and some malicious, flow in and out of networks. Every day it is an essential task for the modern Defensive Cyber Security Organization to be able to reliably survive the sheer volume of data, bring the NETFLOW data to rest, enrich it, correlate it and perform. SHERPASURFING is an open source platform built on the proven Cloudera&#8217;s Distribution including Apache Hadoop that enables organizations to perform the Cyber Security mission and at scale at an affordable price point. This session will include an overview of the solution and components, followed by a demonstration of analytics. </p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools</span></a><br />
<em>Arvind Prabhakar, Cloudera<br />
Guy Harrison, Quest Software</em></p>
<p><strong>Abstract:</strong> As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative. We&#8217;ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we&#8217;ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc. </p>
<p><a href="http://www.hadoopworld.com/sessions/" target="_blank"><span style="color: #4aa02c; font-weight: bold; font-size: 12pt;">Next Generation Apache Hadoop MapReduce</span></a><br />
<em>Mahadev Konar, Hortonworks</em></p>
<p><strong>Abstract:</strong> The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale, high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization. We will be presenting the architecture and design of the next generation of map reduce and will delve into the details of the architecture that makes it much easier to innovate. We will also be presenting large scale and small scale comparisons on some benchmarks with MRV1.&#8221; </p>
<p><a href="http://www.hadoopworld.com/"><img title="Register for Hadoop World" src="https://www.cloudera.com/wp-content/uploads/2010/12/registernow.gif" alt="Register for Hadoop World" /></a></p>
<p>There are several <a href="http://www.hadoopworld.com/training/">training classes</a> and <a href="http://www.hadoopworld.com/training/">certification sessions</a> provided surrounding the Hadoop World conference. Don&#8217;t forget to register and become <a href="http://www.hadoopworld.com/training/">Cloudera Certified in Apache Hadoop</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/10/hadoop-world-2011-a-glimpse-into-development/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>My Summer Internship at Cloudera</title>
		<link>http://www.cloudera.com/blog/2011/10/my-summer-internship-at-cloudera/</link>
		<comments>http://www.cloudera.com/blog/2011/10/my-summer-internship-at-cloudera/#comments</comments>
		<pubDate>Mon, 03 Oct 2011 13:00:24 +0000</pubDate>
		<dc:creator>Omer</dc:creator>
				<category><![CDATA[careers]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[Cloudera careers]]></category>
		<category><![CDATA[Cloudera Internship]]></category>
		<category><![CDATA[cloudera jobs]]></category>
		<category><![CDATA[Hadoop Internship]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=8961</guid>
		<description><![CDATA[This post was written by Daniel Jackoway following his internship at Cloudera during the summer of 2011. When I started my internship at Cloudera, I knew almost nothing about systems programming or Apache Hadoop, so I had no idea what to expect. The most important lesson I learned is that structured data is great as [...]]]></description>
			<content:encoded><![CDATA[<p><em>This post was written by Daniel Jackoway following his internship at Cloudera during the summer of 2011.</em></p>
<p>When I started my internship at Cloudera, I knew almost nothing about systems programming or Apache Hadoop, so I had no idea what to expect. The most important lesson I learned is that structured data is great as long as it is perfect, with the addendum that it is rarely perfect.</p>
<p>My project was to develop a unified view of our customer data. The requirements were simple: pull in data from a variety of systems, group it by customer, and display it. The goal is that when someone at Cloudera needs to see all of the key information about our customers, it is available in one place. In addition, downloading and grouping data will make performing analysis much easier, allowing us to draw new insights about our business and our customers.</p>
<p>I started by writing a script for each data source to download the necessary data and insert it into an HBase table in raw form. Next I wrote a script for each data source that grouped the data by customer, possibly transformed the data (filtering, sorting, inserting child objects within their parents, etc) and inserted it into a separate HBase table where each row corresponds to a single customer, with a column family for each data source. Finally, I exposed the data on an internal website using Django, integrating the different data sources as much as possible.</p>
<p>One of the big challenges was turning the raw data into meaningful information. There were several discrepancies I needed to address with the data. As an example, companies have different names in different systems. Sometimes the difference is simply a matter of capitalization and/or spacing, but a greater challenge is that sometimes abbreviations were used in one system but not another, or one system ended the name with &#8220;, inc.&#8221; but another did not. I considered using fuzzy matching to solve all of these problems and realized that the <a href="http://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> between &#8220;Cloudera&#8221; and &#8220;Cloudera, inc.&#8221; is quite high, so I started looking at other forms of fuzzy matching and thinking of developing one that in particular favored long identical sub-strings. For example, I wanted my algorithm to see &#8220;Cloudera&#8221; and &#8220;Cloudera, inc.&#8221; as being highly similar for sharing the whole &#8220;Cloudera&#8221; part. As I contemplated embarking on a task to which I could have easily devoted the whole summer, I realized that I was heading down a rabbit hole. I took a step back and determined that trying to solve this problem in a fully automated way was not worth my time. It would have been time consuming, and it would have only made my problems worse; I still would have had to deal with names that should be merged but weren&#8217;t (since no scheme could perfectly determine if two names represent the same customer), but I also would have had to worry about names that shouldn&#8217;t have been merged but were. Why would I devote time to building a complex matching algorithm that doubled the number of problems I had to deal with?</p>
<p>Instead, I created an alias table in HBase. The key is the customer name, with white-space removed and letters lower-cased to catch the easiest cases. One of the columns contains a UUID that is used as the key for that customer throughout the rest of the system. When my transform scripts move data from the raw table to the table where each row is a customer, they use the alias table to determine into which row to insert the transformed data. When my code merges two customers, it merges the current data and makes all alias entries that were pointing to either row point to the newly merged row, so that when the scripts next load new data into the table, they put it directly into the correct location. This approach does require manual intervention (in practice, all schemes were going to), but at least it was simple. This was an important lesson from my internship; I learned that some problems aren&#8217;t worth solving.</p>
<p>Another major issue I had to tackle was cases where data was incorrect or incomplete. Our opportunity data, for example had various fields that were not used when the sytem was first configured. For example, contract terms were always a set period of time. Some fields such as the product quantity were changed so older records had a value but in different units than the units used in new records. For this tricky data rather than simply reading the values directly I wrote helper methods to return the value, sometimes trying 4 or 5 different ways to infer the actual value. For example, the contract end date, I had to base the value on close date about half the time. In this case one available helper method returned a tuple of the value that it was trying to infer (the end date) and a Boolean value representing whether the value returned was explicitly specified or inferred from another field. In the web view I used the explicit or inferred flag to add an annotation in the UI indicating when the value was approximated. Users could then look elsewhere if precision was important. This annotation also exposes where source data is missing fields, which can help us update the source data.</p>
<p>Businesses are always changing, so the data that a business decided to keep track of two years ago may not be the same data that makes sense to keep track of today. Many of the values I was trying to infer from data were from fields that we added to our system. The old data still lacked meaningful values because no one had gone back and fixed all of the past data. Each data source having a &#8220;name&#8221; field seems great until you realize that none of the names are the same, and having a field representing the exact information you want is great until you realize that it&#8217;s null in 40% of cases.</p>
<p>Another challenge was that I needed to interact with many different APIs, each with their own quirks, and similarly, I had to use various libraries to parse the different kinds of data and handle the transformations. I always expected APIs and libraries to be perfectly documented and to be designed with my usecase in mind. This was rarely the reality, so these tasks frequently took much longer than I expected. I knew from past experience that building good software always takes longer than planned, but it seemed even more true this summer. I realized that one factor contributing to this was that my whole project was centered around touching as many different &#8220;things&#8221; as possible. Each time I integrated with a new library, API, or existing codebase, there was always an additional cost of figuring out how to approach it. Additionally, there was always a chance that some aspect of the new process would not work as advertised or not provide a direct, optimized way for me to do what I needed. Significantly underestimating the difficulty of every single piece of my project helped me improve my ability to make those estimations.</p>
<p>Overall, my internship at Cloudera was amazing. I got to work with very smart people building and shipping high quality software using HBase. I got to sit next to, eat lunch with, and hear internal talks by people working on an array of fascinating things, happy to share knowledge and advice. I saw how a software company operates and caught glimpses into how diverse companies&#8212;Cloudera&#8217;s customers&#8212;operate. At the beginning of my summer I&#8217;d thought my project was going to be a series of unexciting tasks, many of which I&#8217;ve done before. As it turned out I encountered some very interesting problems and a lot of good lessons to carry forward.</p>
<p>Find available opportunities via the<a href="http://www.cloudera.com/company/careers/">Cloudera Careers web page</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/10/my-summer-internship-at-cloudera/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cloudera in The Cube with Silicon Angle TV at Strata Conference 2011</title>
		<link>http://www.cloudera.com/blog/2011/02/cloudera-in-the-cube-with-silicon-angle-tv-at-strata-conference-2011/</link>
		<comments>http://www.cloudera.com/blog/2011/02/cloudera-in-the-cube-with-silicon-angle-tv-at-strata-conference-2011/#comments</comments>
		<pubDate>Mon, 14 Feb 2011 23:13:47 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[careers]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hadoop video]]></category>
		<category><![CDATA[strata conference hadoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=6491</guid>
		<description><![CDATA[The consensus from the Cloudera attendees of the O&#8217;Reilly Strata Conference last week was that the data-focused conference was nearly pitch perfect for the data scientist, practitioners and enthusiast who attended the event. It was filled with educational and sometimes entertaining sessions, provided ample time for mingling with vendors and attendees and was well run [...]]]></description>
			<content:encoded><![CDATA[<p>The consensus from the Cloudera attendees of the O&#8217;Reilly Strata Conference last week was that the data-focused conference was nearly pitch perfect for the data scientist, practitioners and enthusiast who attended the event. It was filled with educational and sometimes entertaining sessions, provided ample time for mingling with vendors and attendees and was well run in general.</p>
<p>One of the cool activities happening at the conference was live streaming video brought to us from the good folks at <a href="http://www.siliconangle.tv">SiliconAngle</a>. Using a mobile production system called The Cube, Silicon Angle hosts John Furrier (<a href="http://twitter.com/#!/furrier">@furrier</a>) and Dave Vellante interviewed industry luminaries and up and comers while bringing their own perspective. After streaming live for nearly two days these hosts are still able to keep the energy high and the tone light.</p>
<p>In the interviews below John and Dave interview Amr Awadallah, CTO and Co-Founder of Cloudera (<a href="http://twitter.com/#!/awadallah">@awadallah</a>), and John Kreisa, VP Marketing at Cloudera (<a href="http://twitter.com/#!/marked_man">@marked_man</a>); followed by a John and Dave interview with Sarah Sproehnle director of education at Cloudera. During the interviews they cover many different aspects of Cloudera and Apache Hadoop.</p>
<h2><u>Interview 1</u></h2>
<p><object type="application/x-shockwave-flash" height="385" width="640" id="clip_embed_player_flash" data="http://www.justin.tv/widgets/archive_embed_player.swf" bgcolor="#000000"><param name="movie" value="http://www.justin.tv/widgets/archive_embed_player.swf" /><param name="allowScriptAccess" value="always" /><param name="allowNetworking" value="all" /><param name="allowFullScreen" value="true" /><param name="flashvars" value="start_volume=25&#038;title=Strata Conference - It's not Big Data - Just Data - Interview with Cloudera About Hadoop&#038;channel=nicefishfilms&#038;archive_id=278977743&#038;consumer_key=4fuaMvjaiK4BDHOkwHgk1A" /></object></p>
<h2><u>Interview 2</u></h2>
<p><object type="application/x-shockwave-flash" height="385" width="640" id="clip_embed_player_flash" data="http://www.justin.tv/widgets/archive_embed_player.swf" bgcolor="#000000"><param name="movie" value="http://www.justin.tv/widgets/archive_embed_player.swf" /><param name="allowScriptAccess" value="always" /><param name="allowNetworking" value="all" /><param name="allowFullScreen" value="true" /><param name="flashvars" value="start_volume=25&#038;title=Sarah Sproehnle of Cloudera talks Hadoop.&#038;channel=nicefishfilms&#038;archive_id=279408039&#038;consumer_key=4fuaMvjaiK4BDHOkwHgk1A" /></object></p>
<style>
#columns {
}
#columns .column {
  position: relative;
  width: 310px;
  background: #FFF;
  padding: 10px;
  border: 1px solid #DDD;
}
#columns .column h2 {
  margin-top: 0;
}
</style>
<div id="columns" class="clearfix">
<div style="float:right" class="column right-column">
<h2><u>Interview 2</u></h2>
<p><b>Interviewee:</b><br />
Sarah Sproehnle, Cloudera Director of Education</p>
<p><b>Interviewers:</b><br />
John Furrier, SiliconAngle<br />
Dave Vellante, Wikibon.org</p>
<p><b>The items discussed include:</b></p>
<ul>
<li>Yahoo!&#8217;s Hadoop Distribution</li>
<li><a href="http://www.cloudera.com/hadoop-training">Cloudera Hadoop Training</a>: training demand, who&#8217;s training &#038; Why</li>
<li>Cloudera Competition</li>
</ul>
</div>
<div style="float:left" class="column left-column">
<h2><u>Interview 1</u></h2>
<p><b>Interviewees:</b><br />
Amr Awadallah, Cloudera CTO &#038; Co-Founder<br />
John Kreisa, Cloudera VP Marketing</p>
<p><b>Interviewers:</b><br />
John Furrier, SiliconAngle<br />
Dave Vellante, Wikibon.org</p>
<p><b>The items discussed include:</b></p>
<ul>
<li>What is large-scale data?</li>
<li>Why adopt Hadoop?</li>
<li>Scalability of Hadoop</li>
<li>Reasons for Cloudera and Hadoop popularity</li>
<li>Hadoop distribution</li>
<li>Cloudera Competition</li>
<li>Advice to Entrepreneurs</li>
<li>Applications built on top of <a href="http://www.cloudera.com/">Cloudera&#8217;s Distribution for Apache Hadoop</a></li>
<li>HBase</li>
<li>&#8220;Use the right tool for the right job&#8221;</li>
<li>Benefits of Open Source preventing lock-in</li>
<li>Cloudera Customer, Tynt, use case</li>
<li>Hadoop industries</li>
<li>&#8220;Data Scientist&#8221; road map</li>
<li>Cloudera evangelism </li>
<li>Upcoming <a href="http://www.cloudera.com/company/events/">Cloudera events</a></li>
</ul>
</div>
<div class="fake" style="clear:both">&nbsp;</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/02/cloudera-in-the-cube-with-silicon-angle-tv-at-strata-conference-2011/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Top 10 Blog Posts of 2010</title>
		<link>http://www.cloudera.com/blog/2011/01/top-10-blog-posts-of-2010/</link>
		<comments>http://www.cloudera.com/blog/2011/01/top-10-blog-posts-of-2010/#comments</comments>
		<pubDate>Wed, 19 Jan 2011 14:00:27 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[careers]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[cloudera blog]]></category>
		<category><![CDATA[cloudera jobs]]></category>
		<category><![CDATA[hadoop blog]]></category>
		<category><![CDATA[hadoop information]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=6159</guid>
		<description><![CDATA[We blogged about 104 different topics in 2010 and we recently decided to take a look back and see what folks were most interested in reading. &#160;The topics that were featured ranged from Cloudera&#8217;s Distribution for Apache Hadoop technical updates (CDH3b3 being the most recent) to highlighting upcoming Hadoop related events and activities to sharing [...]]]></description>
			<content:encoded><![CDATA[<p>We blogged about 104 different topics in 2010 and we recently decided to take a look back and see what folks were most interested in reading. &#160;The topics that were featured ranged from <a href="http://www.cloudera.com/hadoop/">Cloudera&#8217;s Distribution for Apache Hadoop</a> technical updates (<a href="http://www.cloudera.com/blog/2010/08/cdh3b2-release-recap/">CDH3b3 being the most recent</a>) to highlighting upcoming Hadoop related events and activities to sharing practical insights for implementing Hadoop. We also featured a number of guest blog posts.</p>
<p><strong>Here are the</strong> <a href="http://www.cloudera.com/blog/"><em><strong>top 10 blog posts from 2010</strong></em></a>:</p>
<ol style="margin-left:20px">
<li><a href="http://goo.gl/JMyY3">How to Get a Job at Cloudera</a><br />
<em>Cloudera is hiring around the clock, and this blog highlights the best course of action to increase your chances of becoming a Clouderan.</em></li>
<li><a href="http://goo.gl/irXtA">Why Europe&#8217;s Largest Ad Targeting Platform Uses Hadoop</a><br />
<em>&#8220;As data volumes increased and performance suffered, we recognized a new approach was needed (Hadoop).&#8221; &#8211;Richard Hutton, Nugg.ad CTO </em></li>
<li><a href="http://goo.gl/aTYOH">What&#8217;s New in CDH3b2 Flume</a><br />
<em>Flume, our data movement platform, was introduced to the world and into the open source environment. </em></li>
<li><a href="http://goo.gl/T3Jgb">What&#8217;s New in CDH3b2 Hue</a><br />
<em>Hue, a web UI for Hadoop, is a suite of web applications as well as a platform for building custom applications with a nice UI library.</em></li>
<li><a href="http://goo.gl/YN5EF">Natural Language Processing with Hadoop and Python</a><br />
<em>Data volumes are increasing naturally from text (blogs) and speech (YouTube videos) posing new questions for Natural Language Processing. This involves making sense of lots of data in different forms and extracting useful insights.</em></li>
<li><a href="http://goo.gl/tnsIa">How Raytheon BBN Technologies Researchers are Using Hadoop to Build a Scalable, Distributed Triple Store</a><br />
<em>Raytheon BBN Technologies built a cloud-based triple-store technology, known as SHARD, to address scalability issues in the processing and analysis of Semantic Web data.</em></li>
<li><a href="http://goo.gl/2T1lx">Cloudera&#8217;s Support Team Shares Some Basic Hardware Recommendations</a><br />
<em>The Cloudera support team discusses workload evaluation and the critical role it plays in hardware selection.</em></li>
<li><a href="http://goo.gl/ZVrDQ">Integrating Hive and HBase</a><br />
<em>Facebook explains integrating Hive and HBase to keep their warehouse up to date with the latest information published by users.</em></li>
<li><a href="http://goo.gl/vAnBe">Pushing the Limits of Distributed Processing</a><br />
<em>Google built a 100,000 node Hadoop cluster running on Nexus One mobile phone hardware and powered by Android. The environmental cost of this solution is 1/100<sup>th</sup> the equivalent of running it within their data center. (April Fools)</em></li>
<li><a href="http://goo.gl/xvI76">Using Flume to Collect Apache 2 Web Server Logs</a><br />
<em>This post presents the common use case of using a Flume node to collect Apache 2 web server logs and deliver them to HDFS.</em></li>
</ol>
<p>Aside from <em>How to Get a Job at Cloudera, </em><a href="http://www.cloudera.com/"><span style="color:#505050">Cloudera</span></a> blog readers viewed posts related to <a href="http://www.cloudera.com/hadoop/">CDH and its components</a>, posts exemplifying possibilities with Hadoop in production, and posts highlighting integrations with <a href="http://www.cloudera.com/hadoop/"><span style="color:#505050">Hadoop</span></a>.</p>
<p>Looking forward we plan to continue to feature technical and non-technical topics, as well as guest posts from customers and the community, and plan to increase our number of published posts. If there is a topic you would like to learn more about, or you have a <a href="http://www.cloudera.com/hadoop/"><span style="color:#505050">Hadoop</span></a> story you would like to share we would love to hear your ideas. Email suggestions to <a href="mailto:community@cloudera.com">community@cloudera.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/01/top-10-blog-posts-of-2010/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cloudera Fun &amp; Frightful Halloween Festivities</title>
		<link>http://www.cloudera.com/blog/2010/11/5226/</link>
		<comments>http://www.cloudera.com/blog/2010/11/5226/#comments</comments>
		<pubDate>Mon, 01 Nov 2010 22:22:46 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[careers]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=5226</guid>
		<description><![CDATA[Here at Cloudera we embraced the holiday spirit with the light heartedness that is Halloween by hosting several activities including an engineering hack-a-thon, a hack-a-pumpkin-a-thon, and a costume competition. Cloudera Corporate &#38; a chicken, aka Cloudera engineers at their finest. Hadoop came and joined the festivities dressed in a very realistic pumpkin costume. To keep [...]]]></description>
			<content:encoded><![CDATA[<p>Here at Cloudera we embraced the holiday spirit with the light heartedness that is Halloween by hosting several activities including an engineering hack-a-thon, a hack-a-pumpkin-a-thon, and a costume competition.</p>
<p style="text-align: center;"><a href="http://www.cloudera.com/wp-content/uploads/2010/11/IMG_0141.jpg"><img class="size-full wp-image-5227  aligncenter" title="IMG_0141" src="http://www.cloudera.com/wp-content/uploads/2010/11/IMG_0141.jpg" alt="" width="450" align="aligncenter" /></a></p>
<p style="text-align: center;">Cloudera Corporate &amp; a chicken, aka Cloudera engineers at their finest.</p>
<p><a href="http://www.cloudera.com/wp-content/uploads/2010/11/IMG_0143.jpg"><img class="size-full wp-image-5230 alignnone" style="float: left; margin-right: 10px; margin-top: 5px;" title="IMG_0143" src="http://www.cloudera.com/wp-content/uploads/2010/11/IMG_0143.jpg" alt="Hadoop" width="100" /></a></p>
<p>Hadoop came and joined the festivities dressed in a very realistic pumpkin costume. To keep Hadoop&#8217;s pumpkin costume up to earthquake code specifications, several paper-clips were deemed necessary in the pumpkin&#8217;s structure.</p>
<p>The Cloudera fun continued into the weekend as part of the Cloudera team gathered October 30th in Hollister, CA to take part in the Northern California <a href="http://warriordash.com">Warrior Dash</a>.</p>
<p><a href="http://www.cloudera.com/wp-content/uploads/2010/11/WarriorDash-muddy.jpg"><img class="aligncenter size-full wp-image-5236" title="WarriorDash-muddy" src="http://www.cloudera.com/wp-content/uploads/2010/11/WarriorDash-muddy.jpg" alt="Post Warrior Dash" width="450" /></a></p>
<p><a href="http://www.cloudera.com/wp-content/uploads/2010/11/WarriorDash-muddy.jpg"></a>Cloudera has a very friendly and accepting atmosphere, filled with great people that are passionate about their work. If this sounds like an environment you would like to be a part of check out the <a href="http://www.cloudera.com/company/careers/">careers</a> page on our website. There are numerous job openings here at head quarters (Palo Alto), along with rumors of a San Francisco office to be coming in the very near future. Check out the <a href="http://www.cloudera.com/company/careers/">careers</a> page to learn more.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/11/5226/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>What is in our Kitchen?</title>
		<link>http://www.cloudera.com/blog/2010/09/what-is-in-our-kitchen/</link>
		<comments>http://www.cloudera.com/blog/2010/09/what-is-in-our-kitchen/#comments</comments>
		<pubDate>Tue, 21 Sep 2010 06:22:09 +0000</pubDate>
		<dc:creator>Chad Metcalf</dc:creator>
				<category><![CDATA[careers]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[development]]></category>
		<category><![CDATA[kitchen]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4810</guid>
		<description><![CDATA[If there is one thing that chefs are proud of, it&#8217;s their kitchens. Whether cavernous top-of-the-line affairs or cramped New York apartments, kitchens are the place where raw ingredients are combined with talent and hard work to produce results. The only difference in the world of software is what you will find in our kitchens.&#160; [...]]]></description>
			<content:encoded><![CDATA[<p>If there is one thing that chefs are proud of, it&#8217;s their kitchens. Whether <a href="http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2010/09/10/FDIP1F26JG.DTL">cavernous top-of-the-line affairs</a> or <a href="http://well.blogs.nytimes.com/2008/11/20/mark-bittmans-bad-kitchen/">cramped New York apartments</a>, kitchens are the place where raw ingredients are combined with talent and hard work to produce results. The only difference in the world of software is what you will find in our kitchens.&#160;<span id="more-4810"></span> In an <a rel="nofollow" href="http://news.cnet.com/8301-30684_3-10309375-265.html">interview</a> with CNET, Google&#8217;s Hal Varian attributed Google&#8217;s success to the &#8220;kitchen&#8221; in which their products are developed:<a href="/wp-content/uploads/2010/09/StockPots_0673.jpg"><img class="size-medium wp-image-4827" align="right" src="/wp-content/uploads/2010/09/StockPots_0673-300x198.jpg" alt="" style="margin:15px;margin-right:0" width="300" height="198" /></a></p>
<blockquote><p>&#8220;I also think we have a better kitchen. We&#8217;ve put a lot of effort into  building a really powerful infrastructure at Google, the development  environment at Google is very good.&#8221;</p></blockquote>
<p>The goal of the Kitchen team at Cloudera is to create a powerful  infrastructure for developing, building, testing,  shipping, and supporting our software. Kitchen contributes its expertise to every product Cloudera builds, while also building  out new infrastructure and tools to facilitate future development. Everyone on the Kitchen team writes software. </p>
<p>While the Kitchen team&#8217;s culture was initially inspired by Google&#8217;s infrastructure, we agree with Piaw Na who recently <a href="http://piaw.blogspot.com/2010/04/infrastructure.html">provided some words of caution</a> for companies looking to follow this example:</p>
<blockquote><p> &#8220;In short, I think startups have to be very careful about building generic infrastructure just because that&#8217;s the way Google did things.&#8221;</p></blockquote>
<p>The Kitchen team builds the infrastructure that is needed to solve our company&#8217;s problems. For example, our build system must be capable of coalescing many disparate open source projects into a unified platform. If there is an existing open source tool or framework that meets our needs we use it, improve it, and contribute it back to the project rather then &#8220;rolling our own&#8221;</p>
<p>We use many of the open source tools you might expect, such as <a href="http://hudson-ci.org/">Hudson</a> for continuous integration. Our Hudson instance manages tens of hosts running over seventy projects:</p>
<ul>
<li>Unit tests running on every commit, across multiple platforms, and flavors of Java or Python</li>
<li>Hadoop clusters running on EC2 using <a href="http://incubator.apache.org/projects/whirr.html">Apache Whirr</a></li>
<li>Various code improvement tools such as <a href="http://www.jcarder.org/">jcarder</a>, <a href="http://cobertura.sourceforge.net/">Cobertura</a>, <a href="http://www.atlassian.com/software/clover/">Clover</a>, <a href="http://findbugs.sourceforge.net/">FindBugs</a>, <a href="http://checkstyle.sourceforge.net/">CheckStyle</a> and others</li>
</ul>
<p>If a tool does not exist the Kitchen team tries to leverage existing frameworks to build what is required. For example, our automated build and release system, which is at the heart of the <a href="../blog/2010/08/cdh3b2-release-recap/">Cloudera Distribution for Hadoop (CDH)</a> platform, is built on top of <a href="http://code.google.com/p/boto/">boto</a>. From a single git repository, we use <a href="http://github.com/cloudera/crepo">crepo</a> (another Kitchen project) to check out the latest source of each project within CDH. Then we build source artifacts for all of the projects, which get uploaded to S3. We then spin up an EC2 cluster to build everything for all the supported CentOS releases, Ubuntu, and Debian releases, including both 32 and 64-bit architectures. The resulting packages are stored back in S3, and then staged to a fresh EC2 instance of <a href="http://archive.cloudera.com">archive.cloudera.com</a> for testing. Additional EC2 instances follow and run end-to-end package tests for each package that was built. We turn the crank nightly, not just for each release.</p>
<p>The Kitchen team is in the process of building a <a href="http://culturedcode.com/status/">status</a>, <a href="http://markcipolla.com/hudson-global-dashboard/">dashboard</a>, <a href="http://twitpic.com/fhjlw">radiator</a>, <a href="http://www.panic.com/blog/2010/03/the-panic-status-board/">single-pane-of-glass</a> to <a href="http://www.samsung.com/us/consumer/professional-displays/professional-displays/lcd/LH46MRTLBC/ZA/index.idx?pagetype=prd_detail">prominently display</a> Hudson&#8217;s status, nightly builds, JIRA stats, CDH download statistics, and many other metrics we use daily.</p>
<p>No software company is complete without a cluster or two. Kitchen maintains a development cluster, a long-lived CDH cluster, a security-enabled CDH cluster, and a &#8220;dog-food&#8221; cluster. We&#8217;re currently building out a <a title="Eucalyptus" href="http://www.eucalyptus.com/">Eucalyptus</a> cluster so we can also run our build and test infrastructure in house. We have a large scale cluster in the works and we are busy building out our infrastructure to accommodate it.&#160; We use <a href="https://fedorahosted.org/cobbler/">Cobbler</a>, run <a href="http://ganglia.sourceforge.net/">Ganglia</a> (bias alert, we employ one of the original authors), debate <a href="http://www.opscode.com/chef">Chef</a> and <a href="http://www.puppetlabs.com/">Puppet</a>.</p>
<p>Our Kitchen team is growing. If this sounds like a team you would like to be a part of, get in touch with me on <a title="Cloudera Twitter" href="http://twitter.com/metcalfc">twitter</a> or IRC (#cloudera on freenode.net) or <a href="http://www.cloudera.com/company/careers/">apply directly</a>. Stay tuned for more blog posts about what&#8217;s cooking in our Kitchen.</p>
<p><em>Image courtesy of Chef Olive at <a href="http://kitchenonfire.com/">Kitchen On Fire</a> </em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/09/what-is-in-our-kitchen/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>How to Get a Job at Cloudera</title>
		<link>http://www.cloudera.com/blog/2010/07/how-to-get-a-job-at-cloudera/</link>
		<comments>http://www.cloudera.com/blog/2010/07/how-to-get-a-job-at-cloudera/#comments</comments>
		<pubDate>Wed, 28 Jul 2010 23:44:53 +0000</pubDate>
		<dc:creator>Mike Olson</dc:creator>
				<category><![CDATA[careers]]></category>
		<category><![CDATA[general]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=4199</guid>
		<description><![CDATA[We&#8217;re doing a lot of hiring at Cloudera &#8212; we have jobs open in operations, sales, engineering and elsewhere. Hiring well is hard work. We spend a lot of time on it, and have learned a lot about the kind of people we want to bring in. One of the best ways for us to [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;re doing a lot of hiring at Cloudera &#8212; we have <a href="http://www.cloudera.com/company/careers/">jobs open</a> in operations, sales, engineering and elsewhere. Hiring well is hard work. We spend a lot of time on it, and have learned a lot about the kind of people we want to bring in. One of the best ways for us to do a good job of hiring is to help you do a good job of applying for a job here.</p>
<p>I&#8217;ll begin the post, though, by telling you what doesn&#8217;t work. Several times a day, we get an unsolicited email or phone message from a contingency recruiter like this one:</p>
<blockquote><p><em>I specialize in the industry and wanted to contact you to let you know that I have a strong candidate for your [deleted] position, and wanted to know if you would like to review the resume that I have? My candidate is interested in interviewing as soon as possible.</em></p></blockquote>
<p>If you&#8217;re that candidate, bad news: Your resume goes straight to the bottom of the pile, and it&#8217;s a big pile.</p>
<p>Contingency recruiters get paid by the hiring company when the candidate they introduce gets the job. The amount they get paid varies a little bit, but is generally <b>a quarter to a third of the annual salary of the person they place</b>. That means that a software developer who earns $80K a year shows up with a $20,000 price tag, minimum, due and payable to the recruiter, up front. If you&#8217;re that software developer, you need to be not merely as good as the rest of the candidates we&#8217;re looking at. It&#8217;s not even enough to be a little bit better. You need to be so astonishingly good that you&#8217;re worth our writing a fat check to a third party on your very first day in the office. And, honestly, if you&#8217;re that good, how come we don&#8217;t know about you already?</p>
<p>Now, we do &#8212; occasionally, for a few positions that are especially tough to fill &#8212; work with contingency search firms. If a recruiter talks to you about Cloudera, you should ask: Do you have a signed engagement letter with Cloudera already? If the answer is yes, then you&#8217;re talking to someone we trust (but whom we&#8217;ll have to pay if we hire you, so we&#8217;d still prefer to hear from you directly). If the answer is no, or if they waffle, then you shouldn&#8217;t waste your time with them. They&#8217;re not going to help you get a job at Cloudera.</p>
<p>If you want a job with us, we absolutely want to talk to you. Here&#8217;s some advice on how to reach us on your own.</p>
<p>First, don&#8217;t just email us your resume with a cover letter. It&#8217;s not that we don&#8217;t read those; it&#8217;s that it&#8217;s very, very hard to stand out from the crowd that way. Every company that&#8217;s hiring and posts jobs on its web site gets a lot of resumes by email. Reading all of them is hard work, and (shame on us) we might be a little bit hurried or distracted when yours comes in.</p>
<p>It&#8217;s much better to come in by way of an introduction from someone we already know. We know all the people who work here very well, of course. If one of them sends your name and resume along to a hiring manager, you can bet that it gets special attention. It turns out that Cloudera people are easy to find: We speak at conferences, attend trade shows, hang out in IRC and browse public forums about topics that matter to the company. We blog here, and many of us post on Twitter regularly &#8212; see <a href="http://twitter.com/cloudera/cloudera">@cloudera/cloudera</a>. Check out our <a href="http://www.cloudera.com/company/events/">events</a> page. Look for us on-line or in person at shows. Reach out, person to person, in those places.</p>
<p>You should really try to engage with the Cloudera person, of course. Don&#8217;t just ask for an intro; talk to us us about the topics that you know matter to us. Show us, first, that you know a little bit about what we&#8217;re doing. If you&#8217;re a developer especially, making contributions to the open source projects we work on, or building cool applications on Cloudera&#8217;s Distribution for Hadoop, is a great way to show your chops.</p>
<p>We&#8217;ve got a referral program in place &#8212; when an employee brings us a candidate we hire, that person gets some extra Cloudera stock (we would much rather grant equity to our employees than write checks to recruiters!). As a result, if you&#8217;re a good fit for a position and can establish a personal rapport with one of our current employees, you&#8217;ll have an enthusiastic champion who&#8217;ll make sure we take a hard look at you. Everybody wins.</p>
<p>Before you reach out, though, do your homework. Figure out what we do. We put a lot of work into the web site &#8212; read the stuff there. Understand our products and customers. Know the role you&#8217;re interested in and why you&#8217;re a good fit. It&#8217;s really surprising to me how many people send us unsolicited resumes asking if we have any jobs open that would match their backgrounds. If you can&#8217;t do simple homework, you&#8217;re just not the kind of person we&#8217;re going to hire.</p>
<p>If you can&#8217;t get to us directly, take a look at our customers, partners and investors. Do you have contacts there? Personal introductions really do get noticed, and your effort in making them happen demonstrates, all by itself, that you&#8217;re motivated and clever and willing to do a little extra work.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/07/how-to-get-a-job-at-cloudera/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
	</channel>
</rss>

