<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; hadoop</title>
	<atom:link href="http://www.cloudera.com/blog/tag/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Thu, 24 May 2012 17:53:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>HBase + Hadoop + Xceivers</title>
		<link>http://www.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/</link>
		<comments>http://www.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/#comments</comments>
		<pubDate>Wed, 14 Mar 2012 17:00:14 +0000</pubDate>
		<dc:creator>Lars George</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=13470</guid>
		<description><![CDATA[Introduction Some of the configuration properties found in Hadoop have a direct effect on clients, such as HBase. One of those properties is called &#8220;dfs.datanode.max.xcievers&#8221;, and belongs to the HDFS subproject. It defines the number of server side threads and &#8211; to some extent &#8211; sockets used for data connections. Setting this number too low can [...]]]></description>
			<content:encoded><![CDATA[<h2>Introduction</h2>
<p>Some of the configuration properties found in Hadoop have a direct effect on clients, such as HBase. One of those properties is called &#8220;dfs.datanode.max.xcievers&#8221;, and belongs to the HDFS subproject. It defines the number of server side threads and &#8211; to some extent &#8211; sockets used for data connections. Setting this number too low can cause problems as you grow or increase utilization of your cluster. This post will help you to understand what happens between the client and server, and how to determine a reasonable number for this property.</p>
<h2>The Problem</h2>
<p>Since HBase is storing everything it needs inside HDFS, the hard upper boundary imposed by the &#8221;dfs.datanode.max.xcievers&#8221; configuration property can result in too few resources being available to HBase, manifesting itself as IOExceptions on either side of the connection. Here is an example from the HBase mailing list [1], where the following messages were initially logged on the RegionServer side: </p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">2008-11-11 19:55:52,451 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Could not read from stream<br />2008-11-11 19:55:52,451 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-5467014108758633036_595771<br />2008-11-11 19:55:58,455 WARN org.apache.hadoop.dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.<br />2008-11-11 19:55:58,455 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block blk_-5467014108758633036_595771 bad datanode[0]<br />2008-11-11 19:55:58,482 FATAL org.apache.hadoop.hbase.regionserver.Flusher: Replay of hlog required. Forcing server shutdown</p>
<p style="padding-top:12px">Correlating this with the Hadoop DataNode logs revealed the following entry:</p>
<p style="font-family: 'Courier New', Courier, mono;font-size: small; background-color:#CEE9FF;">ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.10.10.53:50010,storageID=DS-1570581820-10.10.10.53-50010-1224117842339,infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: xceiverCount 258 exceeds the limit of concurrent xcievers 256  </p>
<p style="padding-top:12px">In this example, the low value of &#8220;dfs.datanode.max.xcievers&#8221; for the DataNodes caused the entire RegionServer to shut down. This is a really bad situation. Unfortunately, there is no hard-and-fast rule that explains how to compute the required limit. It is commonly advised to raise the number from the default of 256 to something like 4096 (see [1], [2], [3], [4], and [5] for reference). This is done by adding this property to the hdfs-site.xml file of all DataNodes (note that it is misspelled): </p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">&lt;property&gt;    &lt;name&gt;dfs.datanode.max.xcievers&lt;/name&gt;<br />    &lt;value&gt;4096&lt;/value&gt;<br />  &lt;/property&gt;</p>
<p style="padding-top:12px">Note: You will need to restart your DataNodes after making this change to the configuration file.</p>
<p>This should help with the above problem, but you still might want to know more about how this all plays together, and what HBase is doing with these resources. We will discuss this in the remainder of this post. But before we do, we need to be clear about why you cannot simply set this number very high, say 64K and be done with it.</p>
<p>There is a reason for an upper boundary, and it is twofold: first, threads need their own stack, which means they occupy memory. For current servers this means 1MB per thread[6] by default. In other words, if you use up all the 4096 DataXceiver threads, you need around 4GB of heap to accommodate them. This cuts into the space you have assigned for memstores and block caches, as well as all the other moving parts of the JVM. In a worst case scenario, you might run into an OutOfMemoryException, and the RegionServer process is toast. You want to set this property to a reasonably high number, but not too high either.</p>
<p>Second, having these many threads active you will also see your CPU becoming increasingly loaded. There will be many context switches happening to handle all the concurrent work, which takes away resources for the real work. As with the concerns about memory, you want the number of threads not grow boundlessly, but provide a reasonable upper boundary &#8211; and that is what &#8220;dfs.datanode.max.xcievers&#8221; is for.</p>
<h2>Hadoop File System Details</h2>
<p>From the client side, the HDFS library is providing the abstraction called Path. This class represents a file in a file system supported by Hadoop, represented by the FileSystem class. There are a few concrete implementation of the abstract FileSystem class, one of which is the DistributedFileSytem, representing HDFS. This class in turn wraps the actual DFSClient class that handles all interactions with the remote servers, i.e. the NameNode and the many DataNodes.</p>
<p>When a client, such as HBase, opens a file, it does so by, for example, calling the open() or create() methods of the FileSystem class, here the most simplistic incarnations</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">  public DFSInputStream open(String src) throws IOException<br />  public FSDataOutputStream create(Path f) throws IOException</p>
<p style="padding-top:12px">The returned stream instance is what needs a server-side socket and thread, which are used to read and write blocks of data. They form part of the contract to exchange data between the client and server. Note that there are other, RPC-based protocols in use between the various machines, but for the purpose of this discussion they can be ignored.</p>
<p>The stream instance returned is a specialized DFSOutputStream or DFSInputStream class, which handle all of the interaction with the NameNode to figure out where the copies of the blocks reside, and the data communication per block per DataNode.</p>
<p>On the server side, the DataNode wraps an instance of DataXceiverServer, which is the actual class that reads the above configuration key and also throws the above exception when the limit is exceeded.</p>
<p>When the DataNode starts, it creates a thread group and starts the mentioned DataXceiverServer instance like so:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">  this.threadGroup = new ThreadGroup(&#8220;dataXceiverServer&#8221;);<br />  this.dataXceiverServer = new Daemon(threadGroup,<br />      new DataXceiverServer(ss, conf, this));<br />  this.threadGroup.setDaemon(true); // auto destroy when empty </p>
<p style="padding-top:12px">Note that the DataXceiverServer thread is already taking up one spot of the thread group. The DataNode also has this internal class to retrieve the number of currently active threads in this group:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">  /** Number of concurrent xceivers per node. */<br />  int getXceiverCount() {<br />    return threadGroup == null ? 0 : threadGroup.activeCount();<br />  }</p>
<p style="padding-top:12px">Reading and writing blocks, as initiated by the client, causes for a connection to be made, which is wrapped by the DataXceiverServer thread into a DataXceiver instance. During this hand off, a thread is created and registered in the above thread group. So for every active read and write operation a new thread is tracked on the server side. If the count of threads in the group exceeds the configured maximum then the said exception is thrown and recorded in the DataNode&#8217;s logs:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">  if (curXceiverCount > dataXceiverServer.maxXceiverCount) {<br />    throw new IOException(&#8220;xceiverCount &#8221; + curXceiverCount<br />                          + &#8221; exceeds the limit of concurrent xcievers &#8220;<br />                          + dataXceiverServer.maxXceiverCount);<br />  }</p>
<h2 style="padding-top:12px">Implications for Clients</h2>
<p>Now, the question is, how does the client reading and writing relate to the server side threads. Before we go into the details though, let&#8217;s use the debug information that the DataXceiver class logs when it is created and closed</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">  LOG.debug(&#8220;Number of active connections is: &#8221; + datanode.getXceiverCount());<br />  &#8230;<br />  LOG.debug(datanode.dnRegistration + &#8220;:Number of active connections is: &#8220;     + datanode.getXceiverCount());</p>
<p style="padding-top:12px">and monitor during a start of HBase what is logged on the DataNode. For simplicity&#8217;s sake this is done on a pseudo distributed setup with a single DataNode and RegionServer instance. The following shows the top of the RegionServer&#8217;s status page.</p>
<p><a href="http://www.cloudera.com/wp-content/uploads/2012/05/HadoopHBaseXceiverScreen1.png"><img class="alignnone size-full wp-image-13480" src="http://www.cloudera.com/wp-content/uploads/2012/05/HadoopHBaseXceiverScreen1.png" alt="" width="545" height="294" /></a> </p>
<p>The important part is in the &#8220;Metrics&#8221; section, where it says &#8220;storefiles=22&#8243;. So, assuming that HBase has at least that many files to handle, plus some extra files for the write-ahead log, we should see the above logs message state that we have at least 22 &#8220;active connections&#8221;. Let&#8217;s start HBase and check the DataNode and RegionServer log files:</p>
<p>Command Line:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">$ bin/start-hbase.sh<br />&#8230;</p>
<p style="padding-top:12px">DataNode Log:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">2012-03-05 13:01:35,309 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 1<br />2012-03-05 13:01:35,315 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 2<br />12/03/05 13:01:35 INFO regionserver.MemStoreFlusher: globalMemStoreLimit=396.7m, globalMemStoreLimitLowMark=347.1m, maxHeap=991.7m<br />12/03/05 13:01:39 INFO http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 60030<br />2012-03-05 13:01:40,003 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 1<br />12/03/05 13:01:40 INFO regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052<br />2012-03-05 13:01:40,882 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 3<br />2012-03-05 13:01:40,884 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 4<br />2012-03-05 13:01:40,888 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 3<br />&#8230;<br />12/03/05 13:01:40 INFO regionserver.HRegion: Onlined -ROOT-,,0.70236052; next sequenceid=63083<br />2012-03-05 13:01:40,982 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 3<br />2012-03-05 13:01:40,983 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 4<br />&#8230;<br />12/03/05 13:01:41 INFO regionserver.HRegionServer: Received request to open region: .META.,,1.1028785192<br />2012-03-05 13:01:41,026 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 3<br />2012-03-05 13:01:41,027 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 4<br />&#8230;<br />12/03/05 13:01:41 INFO regionserver.HRegion: Onlined .META.,,1.1028785192; next sequenceid=63082<br />2012-03-05 13:01:41,109 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 3<br />2012-03-05 13:01:41,114 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 4<br />2012-03-05 13:01:41,117 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 5<br />12/03/05 13:01:41 INFO regionserver.HRegionServer: Received request to open 16 region(s)<br />12/03/05 13:01:41 INFO regionserver.HRegionServer: Received request to open region: usertable,,1330944810191.62a312d67981c86c42b6bc02e6ec7e3f.<br />12/03/05 13:01:41 INFO regionserver.HRegionServer: Received request to open region: usertable,user1120311784,1330944810191.90d287473fe223f0ddc137020efda25d.<br />&#8230;</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">2012-03-05 13:01:41,246 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 6<br />2012-03-05 13:01:41,248 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 7<br />&#8230;<br />2012-03-05 13:01:41,257 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 10<br />2012-03-05 13:01:41,257 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 9<br />&#8230;<br />12/03/05 13:01:41 INFO regionserver.HRegion: Onlined usertable,user1120311784,1330944810191.90d287473fe223f0ddc137020efda25d.; next sequenceid=62917<br />12/03/05 13:01:41 INFO regionserver.HRegion: Onlined usertable,,1330944810191.62a312d67981c86c42b6bc02e6ec7e3f.; next sequenceid=62916<br />&#8230;<br />12/03/05 13:01:41 INFO regionserver.HRegion: Onlined usertable,user1361265841,1330944811370.80663fcf291e3ce00080599964f406ba.; next sequenceid=62919<br />2012-03-05 13:01:41,474 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 6<br />2012-03-05 13:01:41,491 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 7<br />2012-03-05 13:01:41,495 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 8<br />2012-03-05 13:01:41,508 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 7<br />&#8230;<br />12/03/05 13:01:41 INFO regionserver.HRegion: Onlined usertable,user1964968041,1330944848231.dd89596e9129e1caa7e07f8a491c9734.; next sequenceid=62920<br />2012-03-05 13:01:41,618 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 6<br />2012-03-05 13:01:41,621 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 7<br />&#8230;<br />2012-03-05 13:01:41,829 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 7<br />12/03/05 13:01:41 INFO regionserver.HRegion: Onlined usertable,user515290649,1330944849739.d23924dc9e9d5891f332c337977af83d.; next sequenceid=62926<br />2012-03-05 13:01:41,832 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 6<br />2012-03-05 13:01:41,838 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 7<br />12/03/05 13:01:41 INFO regionserver.HRegion: Onlined usertable,user757669512,1330944850808.cd0d6f16d8ae9cf0c9277f5d6c6c6b9f.; next sequenceid=62929<br />&#8230;<br />2012-03-05 14:01:39,711 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 4<br />2012-03-05 22:48:41,945 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 4<br />12/03/05 22:48:41 INFO regionserver.HRegion: Onlined usertable,user757669512,1330944850808.cd0d6f16d8ae9cf0c9277f5d6c6c6b9f.; next sequenceid=62929<br />2012-03-05 22:48:41,963 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 4</p>
<p style="padding-top:12px">You can see how the regions are opened one after the other, but what you also might notice is that the number of active connections never climbs to 22 &#8211; it barely even reaches 10. Why is that? To understand this better, we have to see how files in HDFS map to the server-side DataXceiver&#8217;s instance &#8211; and the actual threads they represent. </p>
<h2>Hadoop Deep Dive</h2>
<p>The aforementioned DFSInputStream and DFSOutputStream are really facades around the usual stream concepts. They wrap the client-server communication into these standard Java interfaces, while internally routing the traffic to a selected DataNode &#8211; which is the one that holds a copy of the current block. It has the liberty to open and close these connection as needed. As a client reads a file in HDFS, the client library classes switch transparently from block to block, and therefore from DataNode to DataNode, so it has to open and close connections as needed. </p>
<p>The DFSInputStream has an instance of a DFSClient.BlockReader class, that opens the connection to the DataNode. The stream instance calls blockSeekTo() for every call to read() which takes care of opening the connection, if there is none already. Once a block is completely read the connection is closed. Closing the stream has the same effect of course. </p>
<p>The DFSOutputStream has a similar helper class, the DataStreamer. It tracks the connection to the server, which is initiated by the nextBlockOutputStream() method. It has further internal classes that help with writing the block data out, which we omit here for the sake of brevity.</p>
<p>Both writing and reading blocks requires a thread to hold the socket and intermediate data on the server-side, wrapped in the DataXceiver instance. Depending what your client is doing, you will see the number of connections fluctuate around the number of currently accessed files in HDFS.</p>
<p>Back to the HBase riddle above: the reason you do not see up to 22 (and more) connections during the start is that while the regions open, the only required data is the HFile&#8217;s info block. This block is read to gain vital details about each file, but then closed again. This means that the server-side resource is released in quick succession. The remaining four connections are harder to determine. You can use JStack to dump all threads on the DataNode, which in this example shows this entry:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">&#8220;DataXceiver for client /127.0.0.1:64281 [sending block blk_5532741233443227208_4201]&#8221; daemon prio=5 tid=7fb96481d000 nid=0x1178b4000 runnable [1178b3000]<br />   java.lang.Thread.State: RUNNABLE<br />   &#8230;</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">&#8220;DataXceiver for client /127.0.0.1:64172 [receiving block blk_-2005512129579433420_4199 client=DFSClient_hb_rs_10.0.0.29,60020,1330984111693_1330984118810]&#8221; daemon prio=5 tid=7fb966109000 nid=0x1169cb000 runnable [1169ca000]<br />   java.lang.Thread.State: RUNNABLE<br />   &#8230;</p>
<p style="padding-top:12px">These are the only DataXceiver entries (in this example), so the count in the thread group is a bit misleading. Recall that the DataXceiverServer daemon thread already accounts for one extra entry, which combined with the two above accounts for the three active connections &#8211; which in fact means three active threads. The reason the log states four instead, is that it logs the count from an active thread that is about to finish. So, shortly after the count of four is logged, it is actually one less, i.e. three and hence matching our head count of active threads.</p>
<p>Also note that the internal helper classes, such as the PacketResponder occupy another thread in the group while being active. The JStack output does indicate that fact, listing the thread as such:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;"> &#8221;PacketResponder 0 for Block blk_-2005512129579433420_4199&#8243; daemon prio=5 tid=7fb96384d000 nid=0x116ace000 in Object.wait() [116acd000]<br />   java.lang.Thread.State: TIMED_WAITING (on object monitor)<br />     at java.lang.Object.wait(Native Method)<br />     at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder \<br />       .lastDataNodeRun(BlockReceiver.java:779)<br />     - locked <7bc79c030> (a org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder)<br />     at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:870)<br />     at java.lang.Thread.run(Thread.java:680)</p>
<p style="padding-top:12px">This thread is currently in TIMED_WAITING state and is not considered active. That is why the count emitted by the DataXceiver log statements is not including these kind of threads. If they become active due to the client sending sending data, the active thread count will go up again. Another thing to note its that this thread does not need a separate connection, or socket, between the client and the server. The PacketResponder is just a thread on the server side to receive block data and stream it to the next DataNode in the write pipeline.</p>
<p>The Hadoop fsck command also has an option to report what files are currently open for writing:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">$ hadoop fsck /hbase -openforwrite<br />FSCK started by larsgeorge from /10.0.0.29 for path /hbase at Mon Mar 05 22:59:47 CET 2012<br />&#8230;&#8230;/hbase/.logs/10.0.0.29,60020,1330984111693/10.0.0.29%3A60020.1330984118842 0 bytes, 1 block(s), OPENFORWRITE: &#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;..Status: HEALTHY<br /> Total size:     2088783626 B<br /> Total dirs:     54<br /> Total files:    45<br /> &#8230;</p>
<p>This does not immediately relate to an occupied server-side thread, as these are allocated by block ID. But you can glean from it, that there is one open block for writing. The Hadoop command has additional options to print out the actual files and block ID they are comprised of:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">$ hadoop fsck /hbase -files -blocks<br />FSCK started by larsgeorge from /10.0.0.29 for path /hbase at Tue Mar 06 10:39:50 CET 2012</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">&#8230;<br />/hbase/.META./1028785192/.tmp &lt;dir&gt;<br />/hbase/.META./1028785192/info &lt;dir&gt;<br />/hbase/.META./1028785192/info/4027596949915293355 36517 bytes, 1 block(s):  OK<br />0. blk_5532741233443227208_4201 len=36517 repl=1</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">&#8230;<br />Status: HEALTHY<br /> Total size:     2088788703 B<br /> Total dirs:     54<br /> Total files:     45 (Files currently being written: 1)<br /> Total blocks (validated):     64 (avg. block size 32637323 B) (Total open file blocks (not validated): 1)<br /> Minimally replicated blocks:     64 (100.0 %)<br /> &#8230;</p>
<p style="padding-top:12px">This gives you two things. First, the summary states that there is one open file block at the time the command ran &#8211; matching the count reported by the &#8220;-openforwrite&#8221; option above. Secondly, the list of blocks next to each file lets you match the thread name to the file that contains the block being accessed. In this example the block with the ID &#8220;blk_5532741233443227208_4201&#8243; is sent from the server to the client, here a RegionServer. This block belongs to the HBase .META. table, as shown by the output of the Hadoop fsck command. The combination of JStack and fsck can serve as a poor mans replacement for lsof (a tool on the Linux command line to &#8220;list open files&#8221;).</p>
<p>The JStack also reports that there is a DataXceiver thread, with an accompanying PacketResponder, for block ID &#8220;blk_-2005512129579433420_4199&#8243;, but this ID is missing from the list of blocks reported by fsck. This is because the block is not yet finished and therefore not available to readers. In other words, Hadoop fsck only reports on complete (or synced[7][8], for Hadoop version that support this feature) blocks. </p>
<h2>Back to HBase</h2>
<p>Opening all the regions does not need as many resources on the server as you would have expected. If you scan the entire HBase table though, you force HBase to read all of the blocks in all HFiles: </p>
<p>HBase Shell:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">hbase(main):003:0> scan &#8216;usertable&#8217;<br />&#8230;<br />1000000 row(s) in 1460.3120 seconds</p>
<p style="padding-top:12px">DataNode Log:</p>
<p style="font-family: 'Courier New', Courier, mono; background-color:#CEE9FF;">2012-03-05 14:42:20,580 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 6<br />2012-03-05 14:43:23,293 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 7<br />2012-03-05 14:43:23,299 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 8<br />&#8230;<br />2012-03-05 14:49:24,332 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 11<br />2012-03-05 14:49:24,332 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 10<br />2012-03-05 14:49:59,987 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 11<br />2012-03-05 14:51:12,603 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 12<br />2012-03-05 14:51:12,605 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 11<br />2012-03-05 14:51:46,473 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 12<br />&#8230;<br />2012-03-05 14:56:59,420 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 15<br />2012-03-05 14:57:31,722 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 16<br />2012-03-05 14:58:24,909 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 17<br />2012-03-05 14:58:24,910 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 16<br />&#8230;<br />2012-03-05 15:04:17,688 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 21<br />2012-03-05 15:04:17,689 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 22<br />2012-03-05 15:04:54,545 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 21<br />2012-03-05 15:05:55,901 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1423642448-10.0.0.64-50010-1321352233772, infoPort=50075, ipcPort=50020):Number of active connections is: 22<br />2012-03-05 15:05:55,901 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Number of active connections is: 21</p>
<p style="padding-top:12px">The number of active connections reaches the elusive 22 now. Note that this count already includes the server thread, so we are still a little short of what we could consider the theoretical maximum &#8211; based on the number of files HBase has to handle.</p>
<h2>What does that all mean?</h2>
<p>So, how many &#8220;xcievers (sic)&#8221; do you need? Given you only use HBase, you could simply monitor the above &#8220;storefiles&#8221; metric (which you get also through Ganglia or JMX) and add a few percent for intermediate and write-ahead log files. This should work for systems in motion. However, if you were to determine that number on an idle, fully compacted system and assume it is the maximum, you might find this number being too low once you start adding more store files during regular memstore flushes, i.e. as soon as you start to add data to the HBase tables. Or if you also use MapReduce on that same cluster, Flume log aggregation, and so on. You will need to account for those extra files, and, more importantly, open blocks for reading and writing. </p>
<p>Note again that the examples in this post are using a single DataNode, something you will not have on a real cluster. To that end, you will have to divide the total number of store files (as per the HBase metric) by the number of DataNodes you have. If you have, for example, a store file count of 1000, and your cluster has 10 DataNodes, then you should be OK with the default of 256 xceiver threads per DataNode.</p>
<p>The worst case would be the number of all active readers and writers, i.e. those that are currently sending or receiving data. But since this is hard to determine ahead of time, you might want to consider building in a decent reserve. Also, since the writing process needs an extra &#8211; although shorter lived &#8211; thread (for the PacketResponder) you have to account for that as well. So a reasonable, but rather simplistic formula could be:</p>
<p> <a href="http://www.cloudera.com/wp-content/uploads/2012/03/HadoopHBaseXceiverFormula1.png"><img class="alignnone  wp-image-13479" src="http://www.cloudera.com/wp-content/uploads/2012/03/HadoopHBaseXceiverFormula1.png" alt="" width="433" height="47" /></a></p>
<p>This formula takes into account that you need about two threads for an active writer and another for an active reader. This is then summed up and divided by the number of DataNodes, since you have to specify the &#8220;dfs.datanode.max.xcievers&#8221; per DataNode.</p>
<p>If you loop back to the HBase RegionServer screenshot above, you saw that there were 22 store files. These are immutable and will only be read, or in other words occupy one thread only. For all memstores that are flushed to disk you need two threads &#8211; but only until they are fully written. The files are finalized and closed for good, cleaning up any thread in the process. So these come and go based on your flush frequency. Same goes for compactions, they will read N files and write them into a single new one, then finalize the new file. As for the write-ahead logs, these will occupy a thread once you have started to add data to any table. There is a log file per server, meaning that you can only have twice as many active threads for these files as you have RegionServers.</p>
<p>For a pure HBase setup (HBase plus its own HDFS, with no other user), we can estimate the number of needed DataXceiver&#8217;s with the following formula:</p>
<p><a href="http://www.cloudera.com/wp-content/uploads/2012/03/HadoopHBaseXceiverFormula2.png"><img class="alignnone  wp-image-13478" src="http://www.cloudera.com/wp-content/uploads/2012/03/HadoopHBaseXceiverFormula2.png" alt="" width="782" height="47" /></a></p>
<p>Since you will be hard pressed to determine the <em>active</em> number of store files, flushes, and so on, it might be better to estimate the theoretical maximum instead. This maximum value takes into account that you can only have a single flush and compaction active per region at any time. The maximum number of logs you can have active matches the number of RegionServers, leading us to this formula:</p>
<p>  <a href="http://www.cloudera.com/wp-content/uploads/2012/03/HadoopHBaseXceiverFormula31.png"><img class="alignnone  wp-image-13572" src="http://www.cloudera.com/wp-content/uploads/2012/03/HadoopHBaseXceiverFormula31.png" alt="" width="581" height="49" /></a></p>
<p>Obviously, the number of store files will increase over time, and the number of regions typically as well. Same for the numbers of servers, so keep in mind to adjust this number over time. In practice, you can add a buffer of, for example, 20%, as shown in the formula below &#8211; in an attempt to not force you to change the value too often. </p>
<p>On the other hand, if you keep the number of regions fixed per server[9], and rather split them manually, while adding new servers as you grow, you should be able to keep this configuration property stable for each server.</p>
<h2>Final Advice &amp; TL;DR</h2>
<p>Here is the final formula you want to use:</p>
<p><a href="http://www.cloudera.com/wp-content/uploads/2012/05/HadoopHBaseXceiverFormula4.png"><img class="alignnone  wp-image-13570" src="http://www.cloudera.com/wp-content/uploads/2012/05/HadoopHBaseXceiverFormula4.png" alt="" width="611" height="47" /></a></p>
<p>It computes the maximum number of threads needed, based on your current HBase vitals (no. of store files, regions, and region servers). It also adds a fudge factor of 20% to give you room for growth. Keep an eye on the numbers on a regular basis and adjust the value as needed. You might want to use Nagios with appropriate checks to warn you when any of the vitals goes over a certain percentage of change.</p>
<p>Note: Please make sure you also adjust the number of file handles your process is allowed to use accordingly[10]. This affects the number of sockets you can use, and if that number is too low (default is often 1024), you will get connection issues first. </p>
<p>Finally, the engineering devil on one of your shoulders should already have started to snicker about how horribly non-Erlang-y this is, and how you should use an event driven approach, possibly using Akka with Scala[11] &#8211; if you want to stay within the JVM world. Bear in mind though that the clever developers in the community share the same thoughts and have already started to discuss various approaches[12][13]. </p>
<h2>Links:</h2>
<ul>
<li>[1] <a href="http://old.nabble.com/Re%3A-xceiverCount-257-exceeds-the-limit-of-concurrent-xcievers-256-p20469958.html">http://old.nabble.com/Re%3A-xceiverCount-257-exceeds-the-limit-of-concurrent-xcievers-256-p20469958.html</a></li>
<li>[2] <a href="http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html">http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html</a></li>
<li>[3] <a href="https://issues.apache.org/jira/browse/HDFS-1861">https://issues.apache.org/jira/browse/HDFS-1861</a> &#8221;Rename dfs.datanode.max.xcievers and bump its default value&#8221;</li>
<li>[4] <a href="https://issues.apache.org/jira/browse/HDFS-1866">https://issues.apache.org/jira/browse/HDFS-1866</a> &#8221;Document dfs.datanode.max.transfer.threads in hdfs-default.xml&#8221;</li>
<li>[5] <a href="http://hbase.apache.org/book.html#dfs.datanode.max.xcievers">http://hbase.apache.org/book.html#dfs.datanode.max.xcievers</a></li>
<li>[6] <a href="http://www.oracle.com/technetwork/java/hotspotfaq-138619.html#threads_oom">http://www.oracle.com/technetwork/java/hotspotfaq-138619.html#threads_oom</a></li>
<li>[7] <a href="https://issues.apache.org/jira/browse/HDFS-200">https://issues.apache.org/jira/browse/HDFS-200</a> &#8221;In HDFS, sync() not yet guarantees data available to the new readers&#8221;</li>
<li>[8] <a href="https://issues.apache.org/jira/browse/HDFS-265">https://issues.apache.org/jira/browse/HDFS-265</a> &#8221;Revisit append&#8221;</li>
<li>[9] <a href="http://search-hadoop.com/m/CBBoV3z24H1">http://search-hadoop.com/m/CBBoV3z24H1</a> &#8221;HBase, mail # user &#8211; region size/count per regionserver&#8221;</li>
<li>[10] <a href="http://hbase.apache.org/book.html#ulimit">http://hbase.apache.org/book.html#ulimit</a> &#8221;ulimit and nproc&#8221;</li>
<li>[11] <a href="http://akka.io/">http://akka.io/</a> &#8221;Akka&#8221;</li>
<li>[12] <a href="https://issues.apache.org/jira/browse/HDFS-223">https://issues.apache.org/jira/browse/HDFS-223</a> &#8221;Asynchronous IO Handling in Hadoop and HDFS&#8221;</li>
<li>[13] <a href="https://issues.apache.org/jira/browse/HDFS-918">https://issues.apache.org/jira/browse/HDFS-918</a> &#8221;Use single Selector and small thread pool to replace many instances of BlockSender for reads&#8221;</li>
</ul>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Indexing Files via Solr and Java MapReduce</title>
		<link>http://www.cloudera.com/blog/2012/03/indexing-files-via-solr-and-java-mapreduce/</link>
		<comments>http://www.cloudera.com/blog/2012/03/indexing-files-via-solr-and-java-mapreduce/#comments</comments>
		<pubDate>Fri, 02 Mar 2012 13:00:25 +0000</pubDate>
		<dc:creator>Adam Smieszny</dc:creator>
				<category><![CDATA[CDH]]></category>
		<category><![CDATA[cloudera manager]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[solr]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=13314</guid>
		<description><![CDATA[Several weeks ago, I set about to demonstrate the ease with which Solr and Map/Reduce can be integrated. I was unable to find a simple, yet comprehensive, primer on integrating the two technologies. So I set about to write one. What follows is my bare-bones tutorial on getting Solr up and running to index each [...]]]></description>
			<content:encoded><![CDATA[<p>Several weeks ago, I set about to demonstrate the ease with which <a title="Solr" href="http://lucene.apache.org/solr/" target="_blank">Solr</a> and <a title="Cloudera Distribution Including Apache Hadoop" href="http://www.cloudera.com/hadoop/" target="_blank">Map/Reduce</a> can be integrated. I was unable to find a simple, yet comprehensive, primer on integrating the two technologies. So I set about to write one.</p>
<p>What follows is my bare-bones tutorial on getting Solr up and running to index each word of the complete works of Shakespeare. Note: Special thanks to <a title="Sematext - Solr experts." href="http://sematext.com/" target="_blank">Sematext</a> for looking over the Solr bits and making sure they are sane. Check them out if you’re going to be doing a lot of work with Solr, ElasticSearch, or search in general and want to bring in the experts.</p>
<h2 style="font-size:13pt">First things first</h2>
<p>The way that I got started was by instantiating a new CentOS 6 Virtual Machine. You can pick a different flavor of Linux if that suits you; Hadoop <em>should</em> work fine on any (though advocated distros are SuSE, Ubuntu/Debian, RedHat/CentOS).</p>
<p>If you are fine with CentOS and want to skip some of the manual labor here, you can download a pre-loaded Virtual Machine from the <a title="Cloudera Downloads" href="https://ccp.cloudera.com/display/SUPPORT/Downloads" target="_blank">Cloudera Downloads section</a>, that already includes an installation of Sun Java 6u21 and CDH3u3. You can then skip ahead to installing Solr and downloading sample data as outlined below.</p>
<p>If you are proceeding with a new (virtual) machine, then follow along as follows: Make sure to disable SELinux, if applicable, and enable sshd. For CentOS6, that was done with the following commands:</p>
<pre class="code">[user@localhost ~]$ sudo chkconfig --levels 2345 sshd on
[user@localhost ~]$ /etc/init.d/sshd start
[user@localhost ~]$ vim /etc/selinux/config [set to disabled]</pre>
<p style="padding-top:10px">On that machine, download and install:</p>
<ul>
<li><strong>Java</strong> &#8211; I&#8217;d recommend Java 6u26 as it has been tested with CDH<br /> <a href="http://www.oracle.com/technetwork/java/javase/downloads/index-jsp-138363.html">http://www.oracle.com/technetwork/java/javase/downloads/index-jsp-138363.html<br /> </a>Oracle Java never seems to play nicely with the /etc/alternative system (in my experience), so I force it to be the preferred JRE the old fashioned way:</li>
</ul>
<pre class="code">[user@localhost ~]$ sudo rm /usr/bin/java
[user@localhost ~]$ sudo ln -s /usr/java/jdk1.6.0_26/jre/bin/java \
 /usr/bin/java</pre>
<ul style="padding-top:10px">
<li><strong>Solr</strong> &#8211; download and unzip/untar in whatever directory that you like. For the purpose of this article, I&#8217;ll refer to it as <br /> <a href="http://lucene.apache.org/solr/downloads.html">http://lucene.apache.org/solr/downloads.html</a></li>
<li><strong>Hadoop</strong> &#8211; I am, obviously, biased. But my recommendation would be to use Cloudera Manager (free up to 50 nodes) to set up your VM as a <a href="http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html" target="_blank">pseudo-distributed cluster</a><br /> <a href="http://www.cloudera.com/products-services/tools/">http://www.cloudera.com/products-services/tools/</a></li>
<li><strong>Sample data</strong> &#8211; Complete works of William Shakespeare. I&#8217;d recommend unzipping into a single directory. I&#8217;ll refer to it as <br /> <a href="http://www.ipl.org/div/shakespeare/">http://www.ipl.org/div/shakespeare/</a></li>
</ul>
<p>You can validate that all of the pieces are installed and running correctly by doing the following:</p>
<ul>
<li><strong>Java</strong></li>
</ul>
<pre class="code">[user@localhost ~]$ java -version
java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)</pre>
<ul style="padding-top:10px">
<li><strong>Solr</strong></li>
</ul>
<pre class="code">[user@localhost ~]$ cd /example
[user@localhost ~]$ java -jar start.jar</pre>
<p style="padding-top:10px">The goal is to get up and running quickly here, so I am opting to use the Solr example configuration. Worth noting also that when run in this manner, the Solr server will be started with the default JVM heap size &#8211; which I believe to be the smaller of {1/4 system memory or 1GB}.</p>
<p>Now, you should be able to access the Solr administration GUI (one of the niceties of Solr!) via a web browser inside your VM with the address: http://localhost:8983/solr/admin</p>
<ul>
<li><strong>Hadoop</strong></li>
</ul>
<p>You can validate that Hadoop is installed and running successfully by navigating in your VM&#8217;s browser to: http://localhost:7180, logging in as admin/admin, and seeing the following:</p>
<p><a href="http://www.cloudera.com/wp-content/uploads/2012/03/healthy_hadoop_scm.png"><img class="alignnone size-full wp-image-13316" src="http://www.cloudera.com/wp-content/uploads/2012/03/healthy_hadoop_scm.png" alt="A Healthy Hadoop (pseudo)cluster" width="899" height="295" /></a></p>
<p>You probably want to export the client config XML files (can be done with a single click via Cloudera Manager &#8211; see the Generate Client Configuration buttong), copy them to /usr/lib/hadoop/conf, and then copy the sample text into hdfs:</p>
<pre class="code">[user@localhost ~]$ hadoop fs -put &lt;shakespeare&gt; shakespeare</pre>
<h2 style="padding-top:10px;font-size:13pt">Creating the indexing code</h2>
<p>I have some history with Lucene from a past life, so the high level functionality of Solr was familiar to me. In a nutshell, you index files within Java code by creating a SolrInputDocument, which represents a single entity to index &#8211; a file or document generally &#8211; and using the .addField() to attach fields to this document that you&#8217;d later like to search.</p>
<p>The driver code for the indexer is very simple, in that it takes input file path(s) off the command line, and runs the mapper on the files that it finds. Note that it will accept a directory, and parse all of the files that it finds within.</p>
<pre class="code">public class IndexDriver extends Configured implements Tool {     

  public static void main(String[] args) throws Exception {
    //TODO: Add some checks here to validate the input path
    int exitCode = ToolRunner.run(new Configuration(),
     new IndexDriver(), args);
    System.exit(exitCode);
  }

  @Override
  public int run(String[] args) throws Exception {
    JobConf conf = <strong>new</strong> JobConf(getConf(), IndexDriver.<strong>class</strong>);
    conf.setJobName("Index Builder - Adam S @ Cloudera");
    conf.setSpeculativeExecution(<strong>false</strong>);

    // Set Input and Output paths
    FileInputFormat.<em>setInputPaths</em>(conf, <strong>new</strong> Path(args[0].toString()));
    FileOutputFormat.<em>setOutputPath</em>(conf, <strong>new</strong> Path(args[1].toString()));
    // Use TextInputFormat
    conf.setInputFormat(TextInputFormat.<strong>class</strong>);

    // Mapper has no output
    conf.setMapperClass(IndexMapper.<strong>class</strong>);
    conf.setMapOutputKeyClass(NullWritable.<strong>class</strong>);
    conf.setMapOutputValueClass(NullWritable.<strong>class</strong>);
    conf.setNumReduceTasks(0);
    JobClient.<em>runJob</em>(conf);
    <strong>return</strong> 0;
  }
}</pre>
<p style="padding-top:10px">The Map code is where things get more interesting. A couple notes before we proceed:</p>
<p><em>Solr servers may be used in 2 ways:</em></p>
<ol>
<li>Via embedding a Solr server object within your Java code using EmbeddedSolrServer</li>
<li>Via HTTP requests, using the class CommonsHttpSolrServer with a URL (in our case, http://localhost:8983/solr)</li>
</ol>
<p>In what follows, I elected to go with the <a title="JavaDoc for SUSS" href="http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html" target="_blank">StreamingUpdateSolrServer</a> &#8211; which is a subclass of CommonsHttpSolrServer. More comments on that towards the end.</p>
<p><em>I will assume now that the reader has some familiarity with the Map/Reduce programming paradigm.</em> The salient points for us here are that we will use a Map-only job to read through each file in the input that we provide, and index our chosen fields. Taking the path of least resistance, I used the fact that each line of text is it&#8217;s own Key/Value pair if we read the input as TextInputFormat, and I chose to index the following fields:</p>
<ol>
<li>As a unique identifier for each word, I concatenated the filename, line offset (conveniently provided to the Map code as the &#8220;Key&#8221; because we are using TextInputFormat), and the position on that line of the word</li>
<li>The word itself</li>
</ol>
<p><em>The Solr server obeys field definitions (specifying field names, data types, uniqueness, etc.) as dictated by a schema file.</em> For this example, running Solr as indicated above, the schema is defined by/example/solr/conf/schema.xml</p>
<p>Per my choice to index 2 distinct fields, the relevant fields in the schema are:</p>
<pre class="code">&lt;field name="id" type="string" indexed="true" \
 stored="true" required="true" /&gt;
&lt;field name="text" type="text_general" indexed="true" \
 stored="true" multiValued="true"/&gt;</pre>
<p style="padding-top:10px">Without further ado, then, the code looks like the following:</p>
<pre class="code">public class IndexMapper extends MapReduceBase implements
 Mapper &lt;LongWritable, Text, NullWritable, NullWritable&gt; {
  <strong>private</strong> StreamingUpdateSolrServer server = null;
  <strong>private</strong> SolrInputDocument thisDoc = new SolrInputDocument();
  <strong>private</strong> String fileName;
  <strong>private</strong> StringTokenizer st = null;
  <strong>private</strong> int lineCounter = 0;

  @Override
  <strong>public</strong> <strong>void</strong> configure(JobConf job) {
    String url = "http://localhost:8983/solr";
    fileName = job.get("map.input.file").substring(
      (job.get("map.input.file")).lastIndexOf(
      System.getProperty("file.separator")) +1);
      <strong>try</strong> {
        server = <strong>new</strong> StreamingUpdateSolrServer(url, 100, 5);
      } <strong>catch</strong> (MalformedURLException e) {
        e.printStackTrace();
      }
  }

  @Override
  <strong>public</strong> <strong>void</strong> map(LongWritable key, Text val,
   OutputCollector &lt;NullWritable, NullWritable&gt; output,
   Reporter reporter) <strong>throws</strong> IOException {

    st = <strong>new</strong> StringTokenizer(val.toString());
    lineCounter = 0;
    <strong>while</strong> (st.hasMoreTokens()) {
      thisDoc = <strong>new</strong> SolrInputDocument();
      thisDoc.addField("id", fileName + " "
       + key.toString() + " " + lineCounter++);
      thisDoc.addField("text", st.nextToken());
      <strong>try</strong> {
        server.add(thisDoc);
      } <strong>catch</strong> (SolrServerException e) {
        e.printStackTrace();
      } <strong>catch</strong> (IOException e) {
        e.printStackTrace();
      }
    }
  }

  @Override
  <strong>public</strong> <strong>void</strong> close() <strong>throws</strong> IOException {
  <strong>try</strong> {
      server.commit();
    } <strong>catch</strong> (SolrServerException e) {
      e.printStackTrace();
    }
  }
}</pre>
<p style="padding-top:10px">Compile the code how you see fit (I am old school and still use ant), and the job is ready to run!</p>
<p>To index all of the comedies, you can run the job with the compiled jar file as follows. Note that you must tell hadoop to include an additional Solr jar at runtime:</p>
<pre class="code">[user@localhost SolrTest]$ hadoop jar solrtest.jar \
 -libjars &lt;solr_install_dir&gt;/dist/apache-solr-solrj-3.5.0.jar \
 shakespeare/comedies shakespeare_output</pre>
<p style="padding-top:10px">If you then query the Solr server (via the web GUI at http://localhost:8983/solr/admin, the default search is *:* which works well for a quick test) you should see something like the following:</p>
<pre class="code">&lt;response&gt;
 &lt;lst name="responseHeader"&gt;
  &lt;int name="status"&gt;0&lt;/int&gt;
  &lt;int name="QTime"&gt;35&lt;/int&gt;
  &lt;lst name="params"&gt;
    &lt;str name="indent"&gt;on&lt;/str&gt;
    &lt;str name="start"&gt;0&lt;/str&gt;
    &lt;str name="q"&gt;*:*&lt;/str&gt;
    &lt;str name="version"&gt;2.2&lt;/str&gt;
    &lt;str name="rows"&gt;10&lt;/str&gt;
  &lt;/lst&gt;
 &lt;/lst&gt;
&lt;result name="response" numFound="377452" start="0"&gt;
&lt;doc&gt;
 &lt;str name="id"&gt;troilusandcressida 0 0&lt;/str&gt;
 &lt;arr name="text"&gt;
  &lt;str&gt;TROILUS&lt;/str&gt;
 &lt;/arr&gt;
&lt;/doc&gt;
&lt;doc&gt;
 &lt;str name="id"&gt;troilusandcressida 0 1&lt;/str&gt;
 &lt;arr name="text"&gt;
  &lt;str&gt;AND&lt;/str&gt;
 &lt;/arr&gt;
&lt;/doc&gt;
&lt;doc&gt;
 &lt;str name="id"&gt;troilusandcressida 0 2&lt;/str&gt;
 &lt;arr name="text"&gt;
  &lt;str&gt;CRESSIDA&lt;/str&gt;
 &lt;/arr&gt;
&lt;/doc&gt;</pre>
<p>&#8230;</p>
<h2 style="font-size:13pt">Further Tuning/Investigation Opportunities</h2>
<p><em>Performance Implications of StreamingUpdateSolrServer &#8211; possibility of using EmbeddedSolrServer:</em> What are the optimal tuning parameters for number of threads and batch size when using StreamingUpdateSolrServer? More investigation could be done here. It is also possible to use an EmbeddedSolrServer (per the Rackspace case study in <a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449389732/ref=sr_1_1?s=books&amp;ie=UTF8&amp;qid=1330367237&amp;sr=1-1" target="_blank" title="Hadoop: The Definitive Guide">Hadoop: The Definitive Guide</a>), though it does add some maintenance overhead to create the indexes in a distributed fashion and then later re-combine. I opted to use the StreamingUpdateSolrServer because I believe that it is simpler to get up and running in a small test environment.</p>
<p><em>How to minimize the memory requirements in the Map code:</em> I haven&#8217;t been a full time Java developer in many years, so there are almost certainly things that I&#8217;m missing on how to minimize the memory overhead of the objects used in the Map code. Since this is called for each line in the input, it is critical to make this code as lean as possible. One tip that I came across on this topic is to use (mutable) org.apache.hadoop.io.Text objects rather than (immutable) Strings. I avoided creating any new String objects in this example Map code, but the point is worth noting for other exercises.</p>
<h2 style="font-size:13pt">Resources that I found useful</h2>
<ul>
<li>A great primer that accomplished the indexing via Cascading:<br /> <a href="http://architects.dzone.com/articles/solr-hadoop-big-data-love">http://architects.dzone.com/articles/solr-hadoop-big-data-love</a></li>
<li>Solr Tutorial:<br /> <a href="http://lucene.apache.org/solr/tutorial.html">http://lucene.apache.org/solr/tutorial.html</a></li>
<li>Some sample code for adding, updating, deleting documents on this wiki:<br /> <a href="http://wiki.apache.org/solr/Solrj">http://wiki.apache.org/solr/Solrj</a></li>
<li><a href="http://www.cloudera.com/company/careers/" title="Cloudera Careers">My outstanding coworkers at Cloudera!</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2012/03/indexing-files-via-solr-and-java-mapreduce/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Cloudera Manager 3.7 released</title>
		<link>http://www.cloudera.com/blog/2011/12/cloudera-manager-3-7-released/</link>
		<comments>http://www.cloudera.com/blog/2011/12/cloudera-manager-3-7-released/#comments</comments>
		<pubDate>Tue, 13 Dec 2011 13:00:19 +0000</pubDate>
		<dc:creator>Aparna Ramani</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[cloudera enterprise]]></category>
		<category><![CDATA[cloudera manager]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[installation]]></category>
		<category><![CDATA[scm express]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9906</guid>
		<description><![CDATA[Aparna Ramani is the Director of Engineering for Cloudera Enterprise. Cloudera Manager 3.7, a major new version of Cloudera&#8217;s Management applications for Apache Hadoop, is now available. Cloudera Manager Free Edition is a free download, and the Enterprise edition of Cloudera Manager is available as part of the Cloudera Enterprise subscription. Cloudera Manager 3.7 includes [...]]]></description>
			<content:encoded><![CDATA[<p><em>Aparna Ramani is the Director of Engineering for Cloudera Enterprise.</em></p>
<p>Cloudera Manager 3.7, a major new version of Cloudera&#8217;s Management applications for Apache Hadoop, is now available. Cloudera Manager Free Edition is a free download, and the Enterprise edition of Cloudera Manager is available as part of the Cloudera Enterprise subscription.</p>
<p>Cloudera Manager 3.7 includes several new features and enhancements:</p>
<ul>
<li><strong>Automated Hadoop Deployment</strong> &#8211; Cloudera Manager 3.7 allows you to install the complete Hadoop stack in minutes. &#160;We&#8217;ve now upgraded Cloudera Manager with the easy installation we first introduced in version 3.6 of SCM Express. (SCM Express is now replaced by Cloudera Manager Free Edition.).</li>
<li><strong>Centralized Management UI </strong>- Version 3.5 of the Cloudera Management Suite included distinct modules for Resource Management, Activity Monitoring and Service and Configuration Management. In Cloudera Manager 3.7, all of these feature sets are now integrated into one centralized browser-based administration console.</li>
<li><strong>Service &amp; Configuration Management</strong> -&#160;We added several new configuration wizards to guide you in properly configuring HDFS and HBase host deployments, adding new hosts on demand, and adding/restarting services as needed. Cloudera Manager 3.7 now also manages Oozie and Hue.</li>
<li><strong>Service Monitoring </strong>&#8211;&#160;Cloudera Manager monitors the health of your key Hadoop services&#8212;HDFS, HBase, MapReduce&#8212;and displays alerts on suspicious or bad health. For example, to determine the health of HDFS, Cloudera Manager measures the percentage of corrupt, missing, or under-replicated blocks. Cloudera Manager also checks if the NameNode is swapping memory or spending too much time in Garbage Collection, and whether HDFS has enough free space. Trends in relevant metrics can be visualized through time-series charts.</li>
<li><strong>Log Search </strong>&#8211;&#160;You can search through all logs for Hadoop services across the whole cluster. You can also view results filtered by service, role, host, search phrase and log event severity.</li>
<li><strong>Events and Alerts</strong> &#8211;<strong> </strong>Cloudera Manager proactively reports on important events such as the change in a service&#8217;s health, detection of a log message of appropriate severity, or the slowness (or failure) of a job. Cloudera Manager aggregates the events for easy filtering and viewing, and you can configure Cloudera Manager to send email alerts.</li>
<li><strong>Global Time Control</strong> &#8211; You can view the state of your system for any time period in the past. Combined with health state, events and log information, this feature serves as a powerful diagnostic tool. </li>
<li><strong>Role-based Administration</strong> -&#160;Cloudera Manager 3.7 supports two types of users: admin users, who can change configs and execute commands and workflows; and read-only users, who can only monitor the system. </li>
<li><strong>Configuration versioning and Audit trails &#8211; </strong>You can view a complete history of configuration&#160;changes with user annotations. You can roll-back to previous configuration states.</li>
<li><strong>Activity Monitoring</strong> &#8211; The Activity Monitoring feature includes several performance and scale improvements.</li>
<li><strong>Operational Reports</strong> &#8211;&#160;The &#8216;Resource Manager&#8217; feature in the Cloudera Management Suite 3.5 is now in Cloudera Manager&#8217;s &#8216;Reports&#8217; feature. You can visualize disk usage by user, group, and directory; you can track MapReduce activity on the cluster by job, or by user. </li>
<li><strong>Support Integration</strong> &#8211;&#160;We&#8217;ve improved the Cloudera support experience by adding a feature that lets you send a snapshot of your cluster state to our support team for expedited resolution.</li>
<li><strong>Cloudera Manager Free Edition and 1-click Upgrade</strong> &#8211; The Free Edition of Cloudera Manager includes a subset of the features described above. After you install Cloudera Manager Free Edition, you can easily upgrade to the Enterprise edition by entering a license key. Your data will be preserved as the Cloudera Manager wizard guides you through the upgrade.</li>
</ul>
<p>You can download the new Cloudera Manager 3.7 at: <a href="https://ccp.cloudera.com/display/SUPPORT/Downloads">https://ccp.cloudera.com/display/SUPPORT/Downloads</a> . Check it out. We look forward to your feedback.</p>
<p>P.S. : We&#8217;re hiring! Visit: <a href="http://www.cloudera.com/company/careers">http://www.cloudera.com/company/careers</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/12/cloudera-manager-3-7-released/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Apache Flume &#8211; Architecture of Flume NG</title>
		<link>http://www.cloudera.com/blog/2011/12/apache-flume-architecture-of-flume-ng-2/</link>
		<comments>http://www.cloudera.com/blog/2011/12/apache-flume-architecture-of-flume-ng-2/#comments</comments>
		<pubDate>Fri, 09 Dec 2011 19:22:27 +0000</pubDate>
		<dc:creator>Arvind Prabhakar</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[data collection]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[Flume]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[flume-ng]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=9884</guid>
		<description><![CDATA[This blog was originally posted on the Apache Blog: https://blogs.apache.org/flume/entry/flume_ng_architecture Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Flume is currently undergoing incubation at The Apache Software Foundation. More information on this project can [...]]]></description>
			<content:encoded><![CDATA[<p><em>This blog was originally posted on the Apache Blog: <a href="https://blogs.apache.org/flume/entry/flume_ng_architecture" target="_blank">https://blogs.apache.org/flume/entry/flume_ng_architecture</a></em></p>
<p>Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Flume is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at <a href="http://incubator.apache.org/flume">http://incubator.apache.org/flume</a>. <em>Flume NG</em> is work related to new major revision of Flume and is the subject of this post.</p>
<p>Prior to entering the incubator, Flume saw incremental releases leading up to version 0.9.4. As Flume became adopted it became clear that certain design choices would need to be reworked in order to address problems reported in the field. The work necessary to make this change began a few months ago under the JIRA issue <a href="https://issues.apache.org/jira/browse/FLUME-728">FLUME-728</a>. This work currently resides on a separate branch by the name flume-728, and is informally referred to as Flume NG. At the time of writing this post Flume NG had gone through two internal milestones &#8211; <em>NG Alpha 1</em>, and <em>NG Alpha 2</em> and a formal incubator release of Flume NG is in the works.</p>
<p>At a high-level, Flume NG uses a single-hop message delivery guarantee semantics to provide end-to-end reliability for the system. To accomplish this, certain new concepts have been incorporated into its design, while certain other existing concepts have been either redefined, reused or dropped completely.</p>
<p>In this blog post, I will describe the fundamental concepts incorporated in Flume NG and talk about it&#8217;s high-level architecture. This is a first in a series of blog posts by Flume team that will go into further details of it&#8217;s design and implementation.</p>
<h2>Core Concepts</h2>
<p>The purpose of Flume is to provide a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. The architecture of Flume NG is based on a few concepts that together help achieve this objective. Some of these concepts have existed in the past implementation, but have changed drastically. Here is a summary of concepts that Flume NG introduces, redefines, or reuses from earlier implementation:</p>
<ul>
<li><strong>Event:</strong> A byte payload with optional string headers that represent the unit of data that Flume can transport from it&#8217;s point of origination to it&#8217;s final destination.</li>
<li><strong>Flow:</strong> Movement of events from the point of origin to their final destination is considered a data flow, or simply flow. This is not a rigorous definition and is used only at a high level for description purposes. </li>
<li><strong>Client:</strong> An interface implementation that operates at the point of origin of events and delivers them to a Flume agent. Clients typically operate in the process space of the application they are consuming data from. For example, Flume Log4j Appender is a client.</li>
<li><strong>Agent: </strong>An independent process that hosts flume components such as sources, channels and sinks, and thus has the ability to receive, store and forward events to their next-hop destination. </li>
<li><strong>Source:</strong> An interface implementation that can consume events delivered to it via a specific mechanism. For example, an Avro source is a source implementation that can be used to receive Avro events from clients or other agents in the flow. When a source receives an event, it hands it over to one or more channels.</li>
<li><strong>Channel:</strong> A transient store for events, where events are delivered to the channel via sources operating within the agent. An event put in a channel stays in that channel until a sink removes it for further transport. An example of channel is the JDBC channel that uses a file-system backed embedded database to persist the events until they are removed by a sink. Channels play an important role in ensuring durability of the flows.</li>
<li><strong>Sink: </strong>An interface implementation that can remove events from a channel and transmit them to the next agent in the flow, or to the event&#8217;s final destination. Sinks that transmit the event to it&#8217;s final destination are also known as terminal sinks. The Flume HDFS sink is an example of a terminal sink. Whereas the Flume Avro sink is an example of a regular sink that can transmit messages to other agents that are running an Avro source.</li>
</ul>
<p>These concepts help in simplifying the architecture, implementation, configuration and deployment of Flume.</p>
<h2>Flow Pipeline</h2>
<p>A flow in Flume NG starts from the client. The client transmits the event to it&#8217;s next hop destination. This destination is an agent. More precisely, the destination is a source operating within the agent. The source receiving this event will then deliver it to one or more channels. The channels that receive the event are drained by one or more sinks operating within the same agent. If the sink is a regular sink, it will forward the event to it&#8217;s next-hop destination which will be another agent. If instead it is a terminal sink, it will forward the event to it&#8217;s final destination. Channels allow the decoupling of sources from sinks using the familiar producer-consumer model of data exchange. This allows sources and sinks to have different performance and runtime characteristics and yet be able to effectively use the physical resources available to the system.</p>
<p>Figure 1 below shows how the various components interact with each other within a flow pipeline.</p>
<p style="text-align: center"><img class="aligncenter" src="https://blogs.apache.org/flume/mediaresource/ab0d50f6-a960-42cc-971e-3da38ba3adad" alt="Schematic showing logical components in a flow. The arrows represent the direction in which events travel across the system. This also illustrates how flows can fan-out by having one source write the event out to multiple=" /></p>
<p style="text-align: center"><strong><em>Figure 1:</em></strong><em> Schematic showing logical components in a flow. The arrows represent the direction in which events travel across the system. This also illustrates how flows can fan-out by having one source write the event out to multiple channels.</em></p>
<p>By configuring a source to deliver the event to more than one channel, flows can fan-out to more than one destination. This is illustrated in Figure 1 where the source within the operating Agent writes the event out to two channels &#8211; Channel 1 and Channel 2.</p>
<p>Conversely, flows can be converged by having multiple sources operating within the same agent write to the same channel. A example of the physical layout of a converging flow is show in Figure 2 below.</p>
<p style="text-align: center"><img class="aligncenter" src="https://blogs.apache.org/flume/mediaresource/268bf8db-43c7-497b-a0ef-63c482371eef" alt="A simple converging flow on Flume NG." width="500" height="343" /></p>
<p style="text-align: center"><em><strong>Figure 2:</strong> A simple converging flow on Flume NG.</em></p>
<h2>Reliability and Failure Handling</h2>
<p>Flume NG uses channel-based transactions to guarantee reliable message delivery. When a message moves from one agent to another, two transactions are started, one on the agent that delivers the event and the other on the agent that receives the event. In order for the sending agent to commit it&#8217;s transaction, it must receive success indication from the receiving agent. The receiving agent only returns a success indication if it&#8217;s own transaction commits properly first. This ensures guaranteed delivery semantics between the hops that the flow makes. Figure 3 below shows a sequence diagram that illustrates the relative scope and duration of the transactions operating within the two interacting agents.</p>
<p style="text-align: center"><img class="aligncenter" src="https://blogs.apache.org/flume/mediaresource/a15d9347-da9e-4824-b45f-6c00f0720590" alt="Transactional exchange of events between agents." width="500" height="329" /></p>
<p style="text-align: center"><em><strong>Figure 3:</strong> Transactional exchange of events between agents.</em></p>
<p>This mechanism also forms the basis for failure handling in Flume NG. When a flow that passes through many different agents encounters a communication failure on any leg of the flow, the affected events start getting buffered at the last unaffected agent in the flow. If the failure is not resolved on time, this may lead to the failure of the last unaffected agent, which then would force the agent before it to start buffering the events. Eventually if the failure occurs when the client transmits the event to its first-hop destination, the failure will be reported back to the client which can then allow the application generating the events to take appropriate action.</p>
<p>On the other hand, if the failure is resolved before the first-hop agent fails, the buffered events in various agents downstream will start draining towards their destination. Eventually the flow will be restored to its original characteristic throughput levels. Figure 4 below illustrates a scenario where a flow comprising of two intermediary agents between the client and the central store go through a transient failure. The failure occurs between agent 2 and the central store, resulting in the events getting buffered at the agent 2 itself. Once the failing link has been restored to normal, the buffered events drain out to the central store and the flow is restored to its original throughput characteristics.</p>
<p style="text-align: center"><img class="aligncenter" src="https://blogs.apache.org/flume/mediaresource/ac9d1c83-1089-4730-9546-fe8de509b34c" alt="Failure handling in flows. " width="500" height="352" /></p>
<p style="text-align: center"><em><strong>Figure 4: </strong>Failure handling in flows. In (a) the flow is normal and events can travel from the client to the central store. In (b) a communication failure occurs between Agent 2 and the event store resulting in events being buffered on Agent 2. In (c) the cause of failure was addressed and the flow was restored and any events buffered in Agent 2 were drained to the store.</em></p>
<h2>Wrapping up</h2>
<p>In this post I described the various concepts that are a part of Flume NG and its high-level architecture. This is first of a series of posts from the Flume team that will highlight the design and implementation of this system. In the meantime, if you need anymore information, please feel free to drop an email on the project&#8217;s user or developer lists, or alternatively file the appropriate JIRA issues. Your contribution in any form is welcome on the project.</p>
<h2>Links:</h2>
<p>Project Website: <a target="_blank" href="http://incubator.apache.org/flume/">http://incubator.apache.org/flume/</a></p>
<p>Flume NG Getting Started Guide: <a target="_blank" href="https://cwiki.apache.org/confluence/display/FLUME/Getting+Started">https://cwiki.apache.org/confluence/display/FLUME/Getting+Started</a></p>
<p>Mailing Lists: <a target="_blank" href="http://incubator.apache.org/flume/mail-lists.html">http://incubator.apache.org/flume/mail-lists.html</a></p>
<p>Issue Tracking: <a target="_blank" href="https://issues.apache.org/jira/browse/FLUME">https://issues.apache.org/jira/browse/FLUME</a></p>
<p>IRC Channel: #flume on irc.freenode.net</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/12/apache-flume-architecture-of-flume-ng-2/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Hoop &#8211; Hadoop HDFS over HTTP</title>
		<link>http://www.cloudera.com/blog/2011/07/hoop-hadoop-hdfs-over-http/</link>
		<comments>http://www.cloudera.com/blog/2011/07/hoop-hadoop-hdfs-over-http/#comments</comments>
		<pubDate>Wed, 20 Jul 2011 20:44:21 +0000</pubDate>
		<dc:creator>Alejandro Abdelnur</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=8306</guid>
		<description><![CDATA[What is Hoop? Hoop provides access to all Hadoop Distributed File System (HDFS) operations (read and write) over HTTP/S. Hoop can be used to: Access HDFS using HTTP REST. Transfer data between clusters running different versions of Hadoop (thereby overcoming RPC versioning issues). Access data in a HDFS cluster behind a firewall. The Hoop server [...]]]></description>
			<content:encoded><![CDATA[<h2>What is Hoop?</h2>
<p>Hoop provides access to all Hadoop Distributed File System (HDFS) operations (read and write) over HTTP/S.</p>
<p>Hoop can be used to:</p>
<div style="margin-left: 20px">
<ul>
<li>Access HDFS using HTTP REST.</li>
<li>Transfer data between clusters running different versions of Hadoop (thereby overcoming RPC versioning issues).</li>
<li>Access data in a HDFS cluster behind a firewall. The Hoop server acts as a gateway and is the only system that is allowed to go through the firewall.</li>
</ul>
</div>
<p>Hoop has a Hoop client and a Hoop server component:</p>
<div style="margin-left: 20px">
<ul>
<li>The Hoop server component is a REST HTTP gateway to HDFS supporting all file system operations. It can be accessed using standard HTTP tools (i.e. curl and wget), HTTP libraries from different programing languages (i.e. Perl, JavaScript) as well as using the Hoop client. The Hoop server component is a standard Java web-application and it has been implemented using Jersey (JAX-RS).</li>
<li>The Hoop client component is an implementation of Hadoop FileSystem client that allows using the familiar Hadoop filesystem API to access HDFS data through a Hoop server. </li>
</ul>
</div>
<h2>Hoop and Hadoop HDFS Proxy</h2>
<p>Hoop server is a full rewrite of <a href="http://hadoop.apache.org/hdfs/docs/r0.21.0/hdfsproxy.html" target="_about">Hadoop HDFS Proxy</a>. Although it is similar to Hadoop HDFS Proxy (runs in a servlet-container, provides a REST API, pluggable authentication and authorization), Hoop server improves many of Hadoop HDFS Proxy shortcomings by providing:</p>
<div style="margin-left: 20px">
<ul>
<li>Support for all HDFS operations (read, write, status).</li>
<li>Cleaner HTTP REST API.</li>
<li>JSON format for status data (files status, operations status, error messages).</li>
<li>Kerberos HTTP SPNEGO client/server authentication and pseudo authentication out of the box (using <a href="http://cloudera.github.com/alfredo/docs/latest/index.html">Alfredo</a>).</li>
<li>Hadoop proxy-user support.</li>
<li>Tools such as DistCP could run on either cluster.</li>
</ul>
</div>
<h2>Accessing HDFS files -via Hoop- using Unix &#8216;curl&#8217; command</h2>
<p>Assuming Hoop is running on http://hoopbar:14000, the following examples show how the Unix &#8216;curl&#8217; command can be used to access data in HDFS via Hoop using pseudo authentication.</p>
<p>Getting the home directory:</p>
<pre class="code" style="padding-left: 30px">$ curl -i "http://hoopbar:14000?op=homedir&amp;user.name=babu"
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
{"homeDir":"http:\/\/hoopbar:14000\/user\/babu"}
$</pre>
<p style="padding-top: 8px">Reading a file:</p>
<pre class="code" style="padding-left: 30px">$ curl -i "http://hoopbar:14000?/user/babu/hello.txt&amp;user.name=babu"
HTTP/1.1 200 OK
Content-Type: application/octet-stream
Transfer-Encoding: chunked
Hello World!
$</pre>
<p style="padding-top: 8px">Writing a file:</p>
<pre class="code" style="padding-left: 30px">$ curl -i -X POST "http://hoopbar:14000/user/babu/data.txt?op=create" --data-binary @mydata.txt --header "content-type: application/octet-stream"
HTTP/1.1 200 OK
Location: http://hoopbar:14000/user/babu/data.txt
Content-Type: application/json
Content-Length: 0
$</pre>
<p style="padding-top: 8px">Listing the contents of a directory:</p>
<pre class="code" style="padding-left: 30px">$ curl -i "http://hoopbar:14000?/user/babu?op=list&amp;user.name=babu"
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked

[
  {
    "path" : "http:\/\/hoopbar:14000\/user\/babu\/data.txt"
    "isDir" : false,
    "len" : 966,
    "owner" : "babu",
    "group" : "supergroup",
    "permission" : "-rw-r--r--",
    "accessTime" : 1310671662423,
    "modificationTime" : 1310671662423,
    "blockSize" : 67108864,
    "replication" : 3
  }
]
$</pre>
<p style="padding-top: 8px">Click this link for more details about the <a href="http://cloudera.github.com/hoop/docs/latest/HttpRestApi.html" target="_about">Hoop HTTP REST API</a>.</p>
<h2>Getting Hoop</h2>
<p>Hoop is distributed with an Apache License 2.0.</p>
<p>The source code is available at <a href="http://github.com/cloudera/hoop" target="_about">http://github.com/cloudera/hoop</a>.</p>
<p>Instructions on how to build, install and configure Hoop server and the rest of&#160;documentation is available at&#160;<a href="http://cloudera.github.com/hoop" target="_about">http://cloudera.github.com/hoop</a>.</p>
<h2>Contributing Hoop to Apache Hadoop</h2>
<p>The goal is to contribute Hoop to Apache Hadoop as the next generation of Hadoop HDFS proxy. We are just waiting on the Mavenization of Hadoop Common and Hadoop HDFS which will make integration easier.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/07/hoop-hadoop-hdfs-over-http/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Data Interoperability with Apache Avro</title>
		<link>http://www.cloudera.com/blog/2011/07/avro-data-interop/</link>
		<comments>http://www.cloudera.com/blog/2011/07/avro-data-interop/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 19:13:37 +0000</pubDate>
		<dc:creator>Doug Cutting</dc:creator>
				<category><![CDATA[Avro]]></category>
		<category><![CDATA[Flume]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[sqoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=8075</guid>
		<description><![CDATA[The ecosystem around Apache Hadoop has grown at a tremendous rate. Folks now can use many different pieces of software to process their large data sets, and most choose to use several of these components. Data collected by Flume might be analyzed by Pig and Hive scripts. Data imported with Sqoop might be processed by [...]]]></description>
			<content:encoded><![CDATA[<p>The ecosystem around Apache Hadoop has grown at a tremendous rate. Folks now can use many different pieces of software to process their large data sets, and most choose to use several of these components.  Data collected by Flume might be analyzed by Pig and Hive scripts.  Data imported with Sqoop might be processed by a MapReduce program.  To facilitate these and other scenarios, data produced by each component must be readily consumed by other components.</p>
<h1>Data Interoperability</h1>
<p>One might address this data interoperability in a variety of manners, including the following:</p>
<ul>
<li>Each system might be extended to read all the formats generated by the other systems.  In the limit, this approach is not practical, since one cannot easily anticipate all of the formats new systems might generate.</li>
<li>A library of data conversion programs could be assembled. This would unfortunately add a processing step, to convert the data between formats, slowing processing pipelines.  Note however that many data conversion libraries operate by converting data into and out of a <em>lingua franca</em> format, using a single format as a pivot point. &#160;This hints at a third possibility.</li>
<li>Enable each system to read and write a common format. &#160;Some systems might use other formats internally for performance, but whenever data is meant to be accessible to other systems a common format is used.</li>
</ul>
<p>In practice all of these strategies will used to some extent.  However the last strategy, a common format, seems to offer the most efficient path both in terms of engineering effort and processing time.  This article will focus on the use of Avro&#8217;s data file format as such a common format.</p>
<h1>Avro</h1>
<p>Apache&#160;<a href="http://avro.apache.org/">Avro</a> is a data serialization format.  Avro shares many features with Google&#8217;s Protocol Buffers and Apache Thrift, including:</p>
<ul>
<li>Rich data types.</li>
<li>Fast, compact serialization.</li>
<li>Support for many programming languages.</li>
<li>Datatype evolution, also known as&#160;<em>versioning.</em></li>
</ul>
<p>Avro additionally provides some other features that are especially useful when storing data, namely:</p>
<ul>
<li>Avro defines a standard file format.  Avro data files are self-describing, containing the full schema for the data in the file.  Thus users can exchange Avro data files without also having to separately communicate metadata. &#160;Once an Avro data file is written, one will always be able to read it, with full datatype information, without relying on any external software or metadata repository. &#160;Avro data files also support compression, using Gzip or <a href="http://code.google.com/p/snappy/">Snappy</a> codecs. </li>
<li>Avro&#8217;s serialization is more compact.  Avro avoids storing a field identifier with each field value.  For some datasets this savings can be significant. </li>
<li>Avro implementations permit one to dynamically define new datatypes and to easily process previously unseen datatypes, without generation and loading of code.  This provides natural support for script and query languages. </li>
<li>Avro datatypes can define their sort-order, facillitating use of Avro data in MapReduce or ordered key/value stores. </li>
</ul>
<h1>Avro as a Common Format</h1>
<p>Most of the major ecosystem components already or will soon support reading and writing Avro data files:</p>
<ul>
<li>MapReduce: I added support for Java MapReduce programs, <a href="http://s.apache.org/o6">included</a> in Avro 1.4 and greater.</li>
<li><a href="http://hadoop.apache.org/common/docs/current/streaming.html">Streaming</a>: Tom White from Cloudera has added support for Hadoop Streaming programs to Avro (<a href="https://issues.apache.org/jira/browse/AVRO-808">AVRO-808</a> &amp;&#160;<a href="https://issues.apache.org/jira/browse/AVRO-830">AVRO-830</a>).</li>
<li><a href="http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/">Flume</a> 0.9.2 and above support collecting data in Avro&#8217;s format (<a href="https://issues.apache.org/jira/browse/FLUME-133">FLUME-133</a>), contributed by Jon Hsieh of Cloudera. &#160;Note also that Flume has recently been accepted into the Apache Incubator and will soon be known as Apache Flume.</li>
<li><a href="http://www.cloudera.com/blog/2009/06/introducing-sqoop/">Sqoop</a> 1.3 can import data as Avro data files in HDFS from a relational database (<a href="https://issues.cloudera.org/browse/SQOOP-207">SQOOP-207</a>), contributed by Tom White of Cloudera. &#160;Sqoop has also recently been accepted into the Apache Incubator.</li>
<li><a href="http://pig.apache.org/">Pig</a> release 0.9 will be able read and write Avro data files (<a href="https://issues.apache.org/jira/browse/PIG-1748">PIG-1748</a>), thanks to Lin Guo and Jakob Homan at LinkedIn. </li>
<li><a href="http://hive.apache.org/">Hive</a> support for reading and writing Avro data files has been <a href="https://github.com/jghoman/haivvreo#readme">posted</a> by Jakob Homan of LinkedIn, and should hopefully be included in Hive 0.9 (<a href="https://issues.apache.org/jira/browse/HIVE-895">HIVE-895</a>). </li>
<li><a href="http://incubator.apache.org/hcatalog/">HCatalog</a> input and output drivers have been contributed by Tom White of Cloudera (<a href="https://issues.apache.org/jira/browse/HCATALOG-49">HCATALOG-49</a>).</li>
<li>Thiruvalluvan M. G.&#160;from Yahoo! is working on a column-major format for Avro, which would accelerate Hive and Pig queries (<a href="https://issues.apache.org/jira/browse/AVRO-806">AVRO-806</a>).</li>
</ul>
<p>For folks who are currently using Protocol Buffers or Thrift to store data, some tools for conversion are planned:</p>
<ul>
<li>Raghu Angadi from Twitter is working on tools that will let folks     read and write their Thrift-defined data structures as Avro format data (<a href="https://issues.apache.org/jira/browse/AVRO-804">AVRO-804</a>).</li>
<li>We also hope to soon add tools to convert between Protocol Buffers and Avro (<a href="https://issues.apache.org/jira/browse/AVRO-805">AVRO-805</a>).</li>
</ul>
<p>At Cloudera we&#8217;re committed to helping Avro become a common format for the Hadoop ecosystem. &#160;It&#8217;s great to see so many other companies and individuals also investing in Avro.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/07/avro-data-interop/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Some News Related to the Apache Hadoop Project</title>
		<link>http://www.cloudera.com/blog/2011/02/some-news-related-to-the-apache-hadoop-project/</link>
		<comments>http://www.cloudera.com/blog/2011/02/some-news-related-to-the-apache-hadoop-project/#comments</comments>
		<pubDate>Wed, 02 Feb 2011 17:44:30 +0000</pubDate>
		<dc:creator>Charles Zedlewski</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[apache hadoop]]></category>
		<category><![CDATA[cloudera's distribution for hadoop]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=6315</guid>
		<description><![CDATA[In an announcement on its blog, Yahoo! recently announced plans to stop distributing its own version of Hadoop, and instead to re-focus on improving Apache&#8217;s Hadoop releases. This is great news. Currently, many people running Hadoop use patched versions of the Apache Hadoop package that combine features contributed by Yahoo! and others, but may not [...]]]></description>
			<content:encoded><![CDATA[<p>In an <a href="http://developer.yahoo.com/blogs/hadoop/posts/2011/01/announcement-yahoo-focusing-on-apache-hadoop-discontinuing-the-yahoo-distribution-of-hadoop/" target="_blank">announcement on its blog</a>, Yahoo! recently announced plans to stop distributing its own version of Hadoop, and instead to re-focus on improving Apache&#8217;s Hadoop releases. This is great news. Currently, many people running Hadoop use patched versions of the Apache Hadoop package that combine features contributed by Yahoo! and others, but may not yet be collectively available in a single Apache release. Different teams working on enhancements have made their changes to distinct branches off of old releases. Collecting that work into a single source code package and building a system with the best quality and feature set has been hard work.</p>
<p>New users of Hadoop have generally found this assembly work to be too much trouble. To solve that problem, Cloudera currently distributes a patched version of Apache Hadoop, assembling work from Yahoo!, Cloudera, Facebook and others that has been committed to the Apache project, but not necessarily collectively available in one Apache release.</p>
<p>The Apache Hadoop project contains MapReduce, HDFS and Common. Cloudera packages these along with a number of complementary open-source projects &#8212; Apache HBase, Apache Pig, Apache Hive, Apache Zookeeper, Oozie, Flume, Hue, and others &#8212; that provide useful services for data management, access and use. Right now, HDFS, MapReduce and Common &#8212; the Apache Hadoop packages &#8212; are the only packages that we have to ship with a large collection of patches.</p>
<p>You can think of Apache Hadoop as similar to the Linux kernel: the heart of a larger system. In that case, Cloudera acts like Red Hat or Canonical, providing a complete platform that includes both the kernel and the most popular higher-level packages. We assemble &amp; test the combined components, package them for easy installation, certify the integration of complementary systems and provide a predictable release schedule so users can plan upgrades and updates. Cloudera&#8217;s Distribution for Apache Hadoop is this larger package. It exists to make the power of Hadoop easily available to a larger audience of users.</p>
<p>We thank Yahoo! for its renewed efforts to make Apache Hadoop releases the very best versions of Hadoop. A more robust and powerful kernel makes the entire ecosystem stronger. One of the strengths of Apache Hadoop ecosystem has been the collective contributions of many organizations and individuals that has added up to hundreds of person-years of engineering investment. That investment dwarfs what any single organization or proprietary vendor could muster and this explains the strength and sophistication of the overall system.</p>
<p>Yahoo!&#8217;s commitment to open source development of Hadoop dates to the creation of the project. By concentrating its efforts on the Apache repository, Yahoo! makes a meaningful contribution to everyone in the Apache Hadoop community. &#160;We very much hope that the larger Hadoop community will continue to work in the same way, working together to create excellent Apache releases that everyone can use. Certainly, Cloudera and our customers will benefit from high-quality releases from Apache that require minimal patching for production deployment. We believe that everyone else will, too.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/02/some-news-related-to-the-apache-hadoop-project/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Setting up CDH3 Hadoop on my new Macbook Pro</title>
		<link>http://www.cloudera.com/blog/2011/01/setting-up-cdh3-hadoop-on-my-new-macbook-pro/</link>
		<comments>http://www.cloudera.com/blog/2011/01/setting-up-cdh3-hadoop-on-my-new-macbook-pro/#comments</comments>
		<pubDate>Mon, 10 Jan 2011 14:00:33 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[community]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[#cdh3]]></category>
		<category><![CDATA[apache hadoop]]></category>
		<category><![CDATA[apple]]></category>
		<category><![CDATA[CDH]]></category>
		<category><![CDATA[cdh3b2]]></category>
		<category><![CDATA[mac]]></category>
		<category><![CDATA[mac osx]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=5957</guid>
		<description><![CDATA[This is a guest re-post courtesy of Arun Jacob, Data Architect at Disney, prior to that he was an engineer at RichRelevance and Evri. For the last couple of years, Arun has been focused on data mining/information extraction, using a mix of custom and open source technologies. A New Machine I&#8217;m fortunate enough to have [...]]]></description>
			<content:encoded><![CDATA[<p><em><span style="color: #797979;">This is a guest re-post courtesy of Arun Jacob, Data Architect at Disney, prior to that he was an engineer at RichRelevance and Evri. For the last couple of years, Arun has been focused on data mining/information extraction, using a mix of custom and open source technologies.</span></em></p>
<h2><span style="font-size: large;">A New Machine </span></h2>
<p>
<div style="margin: 0px;">I&#8217;m fortunate enough to have recently received a Macbook Pro, 2.8 GHz Intel dual core, with 8GB RAM. This is the third time I&#8217;ve turned a vanilla mac into a ninja coding machine, and following my design principle of &#8220;first time = coincidence, second time = annoying, third time = pattern&#8221;, I&#8217;ve decided to write down the details for the next time.</div>
</p>
<h2><span style="font-size: large;">Baseline</span></h2>
<p>This section details the pre-hadoop installs I did.</p>
<p><strong>Java</strong></p>
<p>Previously I was running on Leopard, i.e. 10.4, and had to install <a href="http://landonf.bikemonkey.org/static/soylatte/">soylatte</a> to get the latest version of Java. In Snow Leopard, java jdk 1.6.0_22 is installed by default. That&#8217;s good enough for me, for now.</p>
<p><strong>Gcc, etc</strong>.</p>
<p>In order to get these on the box, I had to <a href="http://developer.apple.com/technologies/xcode.html">install XCode</a>, making sure to check the &#8216;linux dev tools&#8217; option.</p>
<p><strong>MacPorts</strong></p>
<p>I installed <a href="http://www.macports.org/">MacPorts</a> in case I needed to upgrade any native libs or tools.</p>
<p><strong>Eclipse</strong></p>
<p>I downloaded the <a href="http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/helios/SR1/eclipse-jee-helios-SR1-macosx-cocoa-x86_64.tar.gz">64 bit Java EE version of Helios</a>.</p>
<p><strong>Tomcat</strong></p>
<p>Tomcat is part of my daily fun, and t<a href="http://www.malisphoto.com/tips/tomcatonosx.html">hese instructions to install tomcat6</a> where helpful. One thing to note is that in order to access the tomcat manager panel, you also need to specify</p>
<p>
<pre class="code">&lt;role rolename="manager"/&gt;
</pre>
<p> <br />
prior to defining </p>
<pre class="code">&lt;user username="admin" password="password" roles="standard,manager,admin"/&gt;
</pre>
</p>
<p>Also, I run tomcat standalone (no httpd), so the mod_jk install part didnt apply. Finally, I chose not to daemonize tomcat because this is a dev box, not a server, and the instructions for compiling and using <a href="http://commons.apache.org/daemon/jsvc.html">jsvc</a> for 64 bit sounded iffy at best.</p>
<h2><span style="font-size: large;">Hadoop</span></h2>
<p>I use the <a href="http://www.cloudera.com/hadoop/">CDH</a> distro. The install was amazingly easy, and their support rocks. Unfortunately, they don&#8217;t have a dmg that drops Hadoop on the box configured and ready to run, so I need to build up my own psuedo mac node. This is what I want my mac to have (for starters):</p>
<ol>
<li><b>1.</b> distinct processes for namenode, job tracker node, and datanode/task tracker nodes.</li>
<li><b>2.</b> formatted HDFS</li>
<li><b>3.</b> Pig 0.8.0</li>
</ol>
<p>I&#8217;m not going to try to auto start hadoop because (again) this is a dev box, and start-all.sh should handle bringing up the JVMs around namenode, job tracker, datanode/tasktracker.</p>
<p>I am installing CDH3, because I&#8217;ve been running it in <a href="https://wiki.cloudera.com/display/DOC/CDH3+Deployment+in+Pseudo-Distributed+Mode">psuedo-mode</a> on my Ubuntu dev box for the last month and have had no issues with it. Also, I want to run Pig 0.8.0, and that version may have some assumptions about the version of Hadoop that it needs.</p>
<p>All of the CDH3 Tarballs can be found at&#160;http://archive.cloudera.com/cdh/3/, and damn, that&#8217;s a lot of tarballs.</p>
<p>I downloaded <a href="http://archive.cloudera.com/cdh/3/hadoop20-0.20.2+737.releasenotes.html">hadoop 0.20.2+737</a>, it&#8217;s (currently) the latest version out there. Because this is my new dev box, I decided to forego the usual security motivated setup of the hadoop user. When this decision comes back to bite me, I&#8217;ll be sure to update this post. In fact, for ease of permissions/etc, I decided to install under my home dir, under &#160;a CDH3 dir, so I could group all CDH3 related installs together. I symlinked the hadoop-0.20+737 dir to hadoop, and I&#8217;ll update it if CDH3 updates their version of hadoop.</p>
<p>After untarring to the directory, all that was left was to make sure the ~/CDH3/hadoop/bin directory was in my .profile PATH settings.</p>
<p><strong>Psuedo Mode Config</strong></p>
<p>I&#8217;m going to set up Hadoop in psuedo distributed mode, just like I have on my Ubuntu box. Unlike Debian/Red Hat CDH distros, where this is an apt-get or yum command, I need to set up conf files on my own.</p>
<p>Fortunately the example-confs subdir of the Hadoop install has a conf.psuedo subdir. I needed to modify the following in core-site.xml:</p>
<p>
<pre class="code">&nbsp;&lt;property&gt; 
&nbsp;&nbsp;&nbsp;&nbsp; &lt;name&gt;hadoop.tmp.dir&lt;/name&gt; 
&nbsp;&nbsp;&nbsp;&nbsp; &lt;value&gt;<i><b>changed_to_a_valid_dir_I_own</b></i>&lt;/value&gt; 
&nbsp;&lt;/property&gt;</pre>
<p> <br />
and the following in hdfs-site.xml:</p>
<pre class="code">&nbsp;&lt;property&gt; 
&nbsp;&nbsp;&nbsp;&nbsp; &lt;!-- specify this so that running 'hadoop namenode -format' formats the right dir --&gt; 
&nbsp;&nbsp;&nbsp;&nbsp; &lt;name&gt;dfs.name.dir&lt;/name&gt; 
&nbsp;&nbsp;&nbsp;&nbsp; &lt;value&gt;<i><b>changed_to_a_different_dir_I_own</b></i>&lt;/value&gt; 
&nbsp; &lt;/property&gt; </pre>
</p>
<p>finally, I symlinked the conf dir at the top level of the Hadoop install to example-configs/conf.pseudo after saving off the original conf:</p>
<p>
<pre class="code">mv ./conf install-conf
ln sf ./example-confs/conf.pseudo conf
</pre>
</p>
<p><h2><span style="font-size: large;">Pig</span></h2>
</p>
<p>Installing Pig is as simple as downloading the tar, setting the path up, and going, sort of. The first time I ran pig, it tried to connect to the default install location of hadoop, /usr/lib/hadoop-0.20/. I made sure to set HADOOP_HOME to point to my install, and verified that the grunt shell connected to my configured HDFS (on port 8020).</p>
<h2><span style="font-size: large;">More To Come</span></h2>
<p>This psuedo node install was relatively painless. I&#8217;m going to continue to install Hadoop/HDFS based tools that may need more (HBase) or less (Hive) configuration, and update in successive posts.</p>
<div class="post-footer">
<div class="post-footer-line post-footer-line-1"><span class="post-author vcard"><em><br />
 Written by<br />
 <span class="fn">Arun Jacob</span><br />
 </em></span></div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/01/setting-up-cdh3-hadoop-on-my-new-macbook-pro/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Configuring Security Features in CDH3</title>
		<link>http://www.cloudera.com/blog/2011/01/configuring-security-features-in-cdh3/</link>
		<comments>http://www.cloudera.com/blog/2011/01/configuring-security-features-in-cdh3/#comments</comments>
		<pubDate>Fri, 07 Jan 2011 14:00:56 +0000</pubDate>
		<dc:creator>Jon Zuanich</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[apache hadoop]]></category>
		<category><![CDATA[CDH]]></category>
		<category><![CDATA[cdh security]]></category>
		<category><![CDATA[cdh3b2]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[cloudera's distribution for hadoop]]></category>
		<category><![CDATA[hadoop security]]></category>
		<category><![CDATA[kerbos]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=5948</guid>
		<description><![CDATA[Post written by Cloudera Software Engineer Aaron T. Myers. Apache Hadoop has had methods of doing user authorization for some time. The Hadoop Distributed File System (HDFS) has a permissions model similar to Unix to control file and directory access, and MapReduce has access control lists (ACLs) per job queue to control which users may [...]]]></description>
			<content:encoded><![CDATA[<p><strong><em>Post written by Cloudera Software Engineer Aaron T. Myers.</em></strong></p>
<p>Apache Hadoop has had methods of doing user authorization for some time. The Hadoop Distributed File System (HDFS) has a permissions model similar to Unix to control file and directory access, and MapReduce has access control lists (ACLs) per job queue to control which users may submit jobs. These authorization schemes allow <a href="http://www.cloudera.com/hadoop/"><span style="color: #505050;">Hadoop</span></a> users and administrators to specify exactly who may access Hadoop&#8217;s resources. However, until recently, these mechanisms relied on a fundamentally insecure method of identifying the user who is interacting with Hadoop. That is, Hadoop had no way of performing reliable authentication. This limitation meant that any authorization system built on top of Hadoop, while helpful to prevent accidental unwanted access, could do nothing to prevent malicious users from accessing other users&#8217; data.</p>
<p>Prior to the availability of Hadoop&#8217;s security features, the only way an organization could meet the requirement for data access protection was to run multiple distinct <a href="http://www.cloudera.com/hadoop/"><span style="color: #505050;">Hadoop</span></a> clusters, and to segregate the groups who have network access to these clusters. This has obvious cost effectiveness implications, but, more importantly, limits the flexibility an organization has with respect to data storage options. One of the inherent powers of Hadoop is the ability to store and correlate all of an organization&#8217;s data. This is impossible if one must a priori relegate data to multiple distinct clusters based on security requirements. Furthermore, because of some organizations&#8217; internal security policies, certain types of data could not be stored in Hadoop at all.</p>
<p>While this was acceptable for many of the first organizations to leverage Hadoop, the increase in Hadoop&#8217;s popularity and penetration into traditional enterprises necessitated the addition of better authentication mechanisms.</p>
<p>Among many of the new features introduced as part of <a href="http://www.cloudera.com/blog/2010/10/cdh3-beta-3-now-available/">CDH3 Beta 3</a>, Hadoop now has the ability to provide strong authentication guarantees. The core Hadoop security work was done almost completely by Yahoo! and subsequently contributed to Apache Hadoop. Rather than create an ad hoc Hadoop-specific authentication scheme, Hadoop&#8217;s authentication system leverages Kerberos. Kerberos is an industry-standard authentication system developed by MIT which has been in existence since 1989. There are multiple open source implementations of Kerberos, including one produced and maintained by MIT itself. Kerberos is also the authentication system underpinning many proprietary identity management systems commonly found in enterprise environments, including Microsoft&#8217;s Active Directory. Hadoop&#8217;s support of Kerberos enables organizations to seamlessly integrate the new authentication features of Hadoop with their existing authentication and single sign-on systems.</p>
<p>All of the <a href="http://www.cloudera.com/community/">components of CDH3 Beta 3</a> now have support for interacting with secure Hadoop clusters, and many have incorporated additional security features which were previously impossible or impractical to implement with the security limitations inherent in Hadoop itself. Because of the complexity of integrating with multiple third-party authentication systems, configuring Hadoop and its associated components to use these systems is non-trivial.</p>
<p><a href="http://www.cloudera.com/"><span style="color: #505050;">Cloudera</span></a> is pleased to announce the general availability of Cloudera&#8217;s <a href="https://ccp.cloudera.com/display/CDHDOC/CDH3+Security+Guide">&#8220;CDH3 Security Guide&#8221;</a>. In this comprehensive guide, you&#8217;ll find <a href="https://wiki.cloudera.com/display/DOC/Configuring+Hadoop+Security+in+CDH3+Beta+3">instructions for enabling the security features of Hadoop itself</a>, as well as for configuring all of the other components of CDH to be able to interact with a Hadoop cluster with security enabled. You&#8217;ll also find a <a href="https://wiki.cloudera.com/display/DOC/Appendix+A+-+Troubleshooting">troubleshooting guide</a> for debugging common errors encountered when configuring a secure Hadoop environment, as well as details for <a href="https://wiki.cloudera.com/display/DOC/Integrating+Hadoop+Security+with+Active+Directory">configuring Hadoop&#8217;s authentication mechanism to use Active Directory</a>. Please email <a href="mailto:cdh-user@cloudera.org">cdh-user@cloudera.org</a> if you have any questions or encounter any issues.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2011/01/configuring-security-features-in-cdh3/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>New Features in Apache Pig 0.8</title>
		<link>http://www.cloudera.com/blog/2010/12/new-features-in-apache-pig-0-8/</link>
		<comments>http://www.cloudera.com/blog/2010/12/new-features-in-apache-pig-0-8/#comments</comments>
		<pubDate>Tue, 21 Dec 2010 14:34:58 +0000</pubDate>
		<dc:creator>John Kreisa</dc:creator>
				<category><![CDATA[pig]]></category>
		<category><![CDATA[#cdh3]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[CDH]]></category>
		<category><![CDATA[cdh3b2]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/?p=5691</guid>
		<description><![CDATA[This is a guest post contributed by Dmitriy Ryaboy (@squarecog) and was originally published in his blog on December 19th. We thought&#160;the information&#160;was&#160;valuable enough&#160;that it was worth&#160;reposting to spread the word even further.&#160; The Pig 0.8 release includes a large number of bug fixes and optimizations, but at the core it is a feature release. [...]]]></description>
			<content:encoded><![CDATA[<h2 id="post-104">This is a guest post contributed by Dmitriy Ryaboy (@squarecog) and was originally published in his <a href="http://squarecog.wordpress.com/2010/12/19/new-features-in-apache-pig-0-8/" target="_blank">blog </a>on December 19th. We thought&#160;the information&#160;was&#160;valuable enough&#160;that it was worth&#160;reposting to spread the word even further.&#160;</h2>
<div>
<p>The Pig 0.8 release includes a large number of bug fixes and optimizations, but at the core it is a feature release. It&#8217;s been in the works for almost a full year (most of the work on 0.7 was completed by January of 2009, although it took a while to actually get the release out), and the amount of time spent on 0.8 really shows.</p>
<p>I <a href="http://squarecog.wordpress.com/2010/08/20/upcoming-features-in-pig-0-8-dynamic-invokers/">meant</a> to describe these in detail in a series of posts, but it seems blogging regularly is not my forte. This release is so chock-full of great new features, however, that I feel compelled to at least list them. So, behold, in no particular order, a non-exhaustive list of new features I am excited about in Pig 0.8:</p>
<li><strong>Support for UDFs in scripting languages</strong></li>
<p>This is exactly what it sounds like &#8212; if your favorite language has a JVM implementation, it can be used to create Pig UDFs.</p>
<p>Pig now ships with support for UDFs in Jython, but other languages can be supported by implementing a few interfaces. Details about the Pig UDFs in Python can be found here: <a href="http://pig.apache.org/docs/r0.8.0/udf.html#Python+UDFs">http://pig.apache.org/docs/r0.8.0/udf.html#Python+UDFs</a></p>
<p>This is the outcome of <a href="http://issues.apache.org/jira/browse/PIG-928">PIG-928</a>; it was quite a pleasure to watch this develop over time &#8212; while most Pig tickets wind up getting worked on by at most one or two people, this turned into a collaboration of quite a few developers, many of them new to the project &#8212; Kishore Gopalakrishna&#8217;s patch was the initial conversation starter, which was then hacked on or merged into similar work by Woody Anderson, Arnab Nandi, Julien Le Dem, Ashutosh Chauhan and Aniket Mokashi (Aniket deserves an extra shout-out for patiently working to incorporate everyone&#8217;s feedback and pushing the patch through the last mile).</p>
<li><strong>PigUnit</strong></li>
<p>A contribution by Romain Rigaux, PigUnit is exactly what it sounds like &#8212; a tool that simplifies the Pig users&#8217; lives by giving them a simple way to unit test Pig scripts.</p>
<p>The documentation at <a href="http://pig.apache.org/docs/r0.8.0/pigunit.html">http://pig.apache.org/docs/r0.8.0/pigunit.html</a> and the code at <a href="http://svn.apache.org/viewvc/pig/trunk/test/org/apache/pig/test/pigunit/TestPigTest.java?view=markup">http://svn.apache.org/viewvc/pig/trunk/test/org/apache/pig/test/pigunit/TestPigTest.java?view=markup</a> speak for themselves as far as usage.</p>
<li><strong>PigStats</strong></li>
<p>Pig can now provide much better visibility into what is going on inside a Pig job than it ever did before, thanks to extensive work by Richard Ding (see <a href="http://issues.apache.org/jira/browse/PIG-1333">PIG-1333</a> and <a href="http://issues.apache.org/jira/browse/PIG-1478">PIG-1478</a>). This feature is a feature in three parts:</p>
<p>1. Script statistics.<br />
This is the most easily visible change. At the end of running a script, Pig will output a table with some basic statistics regarding the jobs that it ran. It looks something like this:</p>
<p>Job Stats (time in seconds):</p>
<table>
<tbody>
<tr>
<td>JobId</td>
<td>Maps</td>
<td>Reduces</td>
<td>Max<br />
Map<br />
Time</td>
<td>Min<br />
Map<br />
Time</td>
<td>Avg<br />
Map<br />
Time</td>
<td>Max<br />
Reduce<br />
Time</td>
<td>Min<br />
Reduce<br />
Time</td>
<td>Avg<br />
Reduce<br />
Time</td>
<td>Alias</td>
<td>Feature</td>
<td>Outputs</td>
</tr>
<tr>
<td>job_xxx</td>
<td>1654</td>
<td>218</td>
<td>84</td>
<td>6</td>
<td>14</td>
<td>107</td>
<td>87</td>
<td>99</td>
<td>counted_data,<br />
data,<br />
grouped_data</td>
<td>GROUP_BY,<br />
COMBINER</td>
<td>&#160;</td>
</tr>
<tr>
<td>job_xxx</td>
<td>2</td>
<td>1</td>
<td>9</td>
<td>6</td>
<td>7</td>
<td>13</td>
<td>13</td>
<td>13</td>
<td>ordered_data</td>
<td>SAMPLER</td>
<td>&#160;</td>
</tr>
<tr>
<td>job_xxx</td>
<td>2</td>
<td>1</td>
<td>26</td>
<td>18</td>
<td>22</td>
<td>31</td>
<td>31</td>
<td>31</td>
<td>ordered_data</td>
<td>ORDER_BY</td>
<td>hdfs://tmp/out,</td>
</tr>
</tbody>
</table>
<p>This is extremely useful when debugging slow jobs, as you can immediately identify which stages of your script are slow, and correlate the slow Map-Reduce jobs with the actual Pig operators and relations in your script &#8212; something that was not trivial before (folks often resorted to setting parallelism to slightly different numbers for every join and group just to figure out which job was doing what. No more of this!)</p>
<p>2. Data in Job XML</p>
<p>Pig now inserts several interesting properties into the Hadoop jobs that it generates, including the relations being generated, Pig features being used, and ids of parent Hadoop jobs. This is quite helpful when monitoring a cluster, and is also handy when examining job history using the HadoopJobHistoryLoader , now part of piggybank (use Pig to mine your job history!).</p>
<p>3. PigRunner API</p>
<p>The same information that is printed out when Pig runs the script from a command line is available if one uses the Java API to start Pig jobs. If you start a script using the <code>PigRunner.run(String args[], ProgressNotificationListener listener)</code>, you will get as a result a <a href="http://pig.apache.org/docs/r0.8.0/api/org/apache/pig/tools/pigstats/PigStats.html">PigStats</a> object that gives you access to the job hierarchy, the Hadoop counters from each job, and so on. You can implement the optional <a href="http://pig.apache.org/docs/r0.8.0/api/org/apache/pig/tools/pigstats/PigProgressNotificationListener.html">ProgressNotificationListener</a> if you want to watch the job as it progresses; the listener will be notified as different component jobs start and finish.</p>
<p>Documentation of the API, new properties in the Job XML, and more, is available at <a href="http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Pig+Statistics">http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Pig+Statistics</a></p>
<li><strong>Scalar values</strong></li>
<p>It&#8217;s very common to need to use some calculated statistic in a calculation to inform other calculations. For example, consider a data set that consists of people and their eye color; we want to calculate the fraction of the total population that has a given eye color. The script looks something like this:</p>
<pre>people = LOAD '/data/people' using PigStorage()
  AS (person_id:long, eye_color:chararray);
num_people = FOREACH (group people all)
  GENERATE COUNT(people) AS total;
eye_color_fractions = FOREACH ( GROUP people BY eye_color )
  GENERATE
    group as eye_color,
    COUNT(people) / num_people.total AS fraction;</pre>
<p>&#160;</p>
<p>Pretty straightforward, except it does not work. What&#8217;s happening in the above code is that we are referencing the relation <code>num_people</code> from inside another relation, <code>eye_color_fractions</code> and this doesn&#8217;t really make sense if Pig does not know that <code>num_people</code> only has one row.</p>
<p>In the past you had to do something hacky like joining the two relations on a constant to replicate the total into each row, and then generate the division. Needless to say, this was not entirely satisfactory. In <a href="http://issues.apache.org/jira/browse/PIG-1434">PIG-1434</a> Aniket Mokashi tackled this, implementing an elegant solution that hides all of these details from the user &#8212; you can now simply cast a single-row relation as a scalar, and use it as desired. The above script becomes:</p>
<pre>people = LOAD '/data/people' using PigStorage()
  AS (person_id:long, eye_color:chararray);
num_people = FOREACH (group people all)
  GENERATE COUNT(people) AS total;
eye_color_fractions = FOREACH ( GROUP people BY eye_color )
  GENERATE
    group as eye_color,
    COUNT(people) / <strong>(long)</strong> num_people.total AS fraction;</pre>
<p>&#160;</p>
<p>This makes the casting explicit, but Pig is now smart enough to do this implicitly as well. A runtime exception is generated if the relation being used as a scalar contains more than one tuple.</p>
<p>More documentation of this feature is available at <a href="http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Casting+Relations+to+Scalars">http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Casting+Relations+to+Scalars</a></p>
<li><strong>Monitored UDFs</strong></li>
<p>A new annotation has been added, <code>@MonitoredUDF</code>, which makes Pig spawn a watcher thread that kills an execution that is taking too long, and return a default value instead. This comes in handy when dealing with certain operations like complex regular expressions. More documentation is available at <a href="http://pig.apache.org/docs/r0.8.0/udf.html#Monitoring+long-running+UDFs">http://pig.apache.org/docs/r0.8.0/udf.html#Monitoring+long-running+UDFs</a></p>
<li><strong>Automatic merge of small files</strong></li>
<p>This is a simple one, but useful &#8212; when running Pig over many small files, instead of creating a map task per file (paying the overhead of scheduling and running a task for a computation that might only take a few seconds), we can merge the inputs and create a few map tasks that are a bit more hefty.</p>
<p>Two properties control this behavior: <code>pig.maxCombinedSplitSize</code> controls the maximum size of the resulting split, and <code>pig.splitCombination</code> controls whether or not the feature is activated in the first place (it is on by default).</p>
<p>This work is documented in the ticket <a href="http://issues.apache.org/jira/browse/PIG-1518">PIG-1518</a>; there are additional details in the release notes attached to the ticket.</p>
<li><strong>Generic UDFs</strong></li>
<p>I <a href="http://squarecog.wordpress.com/2010/08/20/upcoming-features-in-pig-0-8-dynamic-invokers/">wrote about this one</a> before &#8212; a small feature that allows you to invoke static Java methods as Pig UDFs without needing to wrap them in custom code.</p>
<p>The official documentation is available at <a href="http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Dynamic+Invokers">http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Dynamic+Invokers</a></p>
<li><strong>Safeguards against missing PARALLEL keyword</strong></li>
<p>One of the more common mistakes people make when writing Pig scripts is forgetting to specify parallelism for operators that need it. The default behavior used to be that this means parallelism of 1, which can lead to extremely inefficient jobs. A patch by Jeff Zhang in <a href="http://issues.apache.org/jira/browse/PIG-1249">PIG-1249</a> changes this behavior to instead use a simple heuristic: if parallelism is not specified, derive the number of reducers by taking <code>MIN(max_reducers, total_input_size / bytes_per_reducer)</code>. Max number of reducers is controlled by the property <code>pig.exec.reducers.max</code> (default 999) and bytes per reducer are controlled by <code>pig.exec.reducers.bytes.per.reducer</code> (default 1GB).</p>
<p>This is a safeguard, not a panacea; it only works with file-based input, estimates number of reducers based on input size, not the size of the intermediate data &#8212; so if you have a highly selective filter, or you are grouping a large dataset by a low-cardinality field, it will produce bad number &#8212; but it&#8217;s a nice safeguard against dramatic misconfigurations.</p>
<blockquote><p>When porting to Apache Pig 0.8, remember to audit your scripts for parallelized operators that do not specify the <code>PARALLEL</code> keyword &#8212; if the intent is to use a single reducer, make this intent explicit by specifying <code>PARALLEL 1</code>.</p>
</blockquote>
<li><strong>HBaseStorage</strong></li>
<p>HBaseStorage has been shored up in Pig 0.8. It can now read data stored in as bytes instead of requiring all numbers to be converted to Strings; it accepts a number of options &#8212; limit the number of rows returned, push down filters on HBase keys, etc. In addition, it can now be used to write to HBase in addition to reading from it. Details about the options, etc, can be found in the Release Notes section of <a href="http://issues.apache.org/jira/browse/PIG-1205">PIG-1205</a>.</p>
<p>Note that at the moment this only works with the HBase 0.20.{4,5,6} releases, and does not work with 0.89+. There is a patch in <a href="http://issues.apache.org/jira/browse/PIG-1680">PIG-1680</a> that you can apply if you need 0.89 and 0.90 compatibility; it is not applied to the main codebase yet, as it is not backwards compatible.</p>
<p>We are very interested in help making this Storage engine more featureful, please feel free to jump in and contribute!</p>
<li><strong>Support for custom Map-Reduce jobs in the flow</strong></li>
<p>Although we try to make these a rarity, sometimes cases come up in which a custom Map-Reduce job fits the bill better than Pig. Weaving a Map-Reduce job into the middle of a Pig workflow was awkward before &#8212; you had to use something like Oozie or Azkaban, or write your own workflow application. Pig 0.8 introduces a simple &#8220;MAPREDUCE&#8221; operator which allows you to invoke an opaque MR job in the middle of the flow, and continue with Pig:</p>
<pre>text = load 'WordcountInput.txt';
wordcount = MAPREDUCE wordcount.jar
  STORE text INTO 'inputDir'
  LOAD 'outputDir' AS (word:chararray, count: int)
  `org.myorg.WordCount inputDir outputDir`;</pre>
<p>&#160;</p>
<p>Details are available on the wiki page: <a href="http://wiki.apache.org/pig/NativeMapReduce">http://wiki.apache.org/pig/NativeMapReduce</a></p>
<p>The ticket for this one has been open for a while, since Pig 0.2 days, and it&#8217;s nice to see it finally implemented. Thumbs up to Aniket Mokashi for this one.</p>
<li><strong>Custom Partitioners</strong></li>
<p>This feature, also implemented by the amazingly productive Aniket Mokashi, is also a bit of a power-user thing (and also an ancient ticket, PIG-282). It allows the Pig script author to control the function used to distribute map output among reducers. By default, Pig uses a random hash partitioner, but sometimes a custom algorithm is required when the script author knows something particularly unique about the reduce key distribution. When that is the case, a user can now specify the Hadoop Partitioner to swap in instead of the default:</p>
<p><code>B = group A by $0 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; </code></p>
<p>More specific documentation can be found in the Release Notes section of <a href="http://issues.apache.org/jira/browse/PIG-282">PIG-282</a></p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2010/12/new-features-in-apache-pig-0-8/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

