<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Configuration Parameters: What can you just ignore?</title>
	<atom:link href="http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Fri, 10 Feb 2012 20:11:24 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Alan</title>
		<link>http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/comment-page-1/#comment-11815</link>
		<dc:creator>Alan</dc:creator>
		<pubDate>Thu, 06 May 2010 18:57:43 +0000</pubDate>
		<guid isPermaLink="false">http://www.cloudera.com/blog/?p=380#comment-11815</guid>
		<description>I cannot get the ulimit change to be permanent in Ubuntu 9.10.  I edited the /etc/security/limits.conf file to contain &quot; hard nofile 16384&quot; then logged off and logged back on.  But &quot;ulimit -a&quot; still shows a nofile limit of 1024.  Any suggestions on how to make this permanent?  (Executing &quot;ulimit -n 16384&quot; did work within its terminal window.)</description>
		<content:encoded><![CDATA[<p>I cannot get the ulimit change to be permanent in Ubuntu 9.10.  I edited the /etc/security/limits.conf file to contain &#8221; hard nofile 16384&#8243; then logged off and logged back on.  But &#8220;ulimit -a&#8221; still shows a nofile limit of 1024.  Any suggestions on how to make this permanent?  (Executing &#8220;ulimit -n 16384&#8243; did work within its terminal window.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: 7 Tips for Improving MapReduce Performance &#187; Cloudera Hadoop &#38; Big Data Blog</title>
		<link>http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/comment-page-1/#comment-8402</link>
		<dc:creator>7 Tips for Improving MapReduce Performance &#187; Cloudera Hadoop &#38; Big Data Blog</dc:creator>
		<pubDate>Thu, 17 Dec 2009 19:19:23 +0000</pubDate>
		<guid isPermaLink="false">http://www.cloudera.com/blog/?p=380#comment-8402</guid>
		<description>[...] is to make sure your cluster configuration has been tuned. For starters, check out our earlier blog post on configuration parameters. In addition to those knobs in the Hadoop configuration, here are a few more checklist items you [...]</description>
		<content:encoded><![CDATA[<p>[...] is to make sure your cluster configuration has been tuned. For starters, check out our earlier blog post on configuration parameters. In addition to those knobs in the Hadoop configuration, here are a few more checklist items you [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: aaron</title>
		<link>http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/comment-page-1/#comment-686</link>
		<dc:creator>aaron</dc:creator>
		<pubDate>Mon, 06 Apr 2009 17:33:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.cloudera.com/blog/?p=380#comment-686</guid>
		<description>Hi Abhishek,

That&#039;s an interesting problem. I&#039;m not particularly certain of the answer. Looking through the configuration, no settings jump out at me in terms of controlling the combiner process. The mapred.inmem.merge.threshold parameter defaults to 1000. So I think that after 1000 intermediate files get created, they should be merged before presentation to the combiner. I&#039;m not sure why so many millions of files are being created.

When you say that the mapper runs &quot;a billion iterations,&quot; does that mean that for every input record to the mapper, you generate a billion (k, v) pairs as output? How many input records were you processing? I&#039;m not sure Hadoop was really designed for a 1,000,000,000:1 fanout ratio. Most MapReduce jobs characteristically have less data output than input.

You might have better luck getting a solution from the Hadoop Core mailing list; sign up at http://hadoop.apache.org/core/mailing_lists.html. Not only are there more of us there who can help, the list format is much more well-suited to the back-and-forth required to diagnose these sorts of issues.

Regards,
- Aaron Kimball</description>
		<content:encoded><![CDATA[<p>Hi Abhishek,</p>
<p>That&#8217;s an interesting problem. I&#8217;m not particularly certain of the answer. Looking through the configuration, no settings jump out at me in terms of controlling the combiner process. The mapred.inmem.merge.threshold parameter defaults to 1000. So I think that after 1000 intermediate files get created, they should be merged before presentation to the combiner. I&#8217;m not sure why so many millions of files are being created.</p>
<p>When you say that the mapper runs &#8220;a billion iterations,&#8221; does that mean that for every input record to the mapper, you generate a billion (k, v) pairs as output? How many input records were you processing? I&#8217;m not sure Hadoop was really designed for a 1,000,000,000:1 fanout ratio. Most MapReduce jobs characteristically have less data output than input.</p>
<p>You might have better luck getting a solution from the Hadoop Core mailing list; sign up at <a href="http://hadoop.apache.org/core/mailing_lists.html" rel="nofollow">http://hadoop.apache.org/core/mailing_lists.html</a>. Not only are there more of us there who can help, the list format is much more well-suited to the back-and-forth required to diagnose these sorts of issues.</p>
<p>Regards,<br />
- Aaron Kimball</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Abhishek Verma</title>
		<link>http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/comment-page-1/#comment-664</link>
		<dc:creator>Abhishek Verma</dc:creator>
		<pubDate>Sat, 04 Apr 2009 05:16:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.cloudera.com/blog/?p=380#comment-664</guid>
		<description>I created a hadoop job which ran a billion iterations per mapper and ran it on a Hadoop cluster of 62 dual-quad core nodes. I was also using the combiner optimization to decrease the intermediate data. A million iterations ran under a minute, but the billion iterations ran for &gt; 40 hours.At the end of it, I killed the job but the cleanup seemed to be taking forever. 

The framework made &gt; 3 million files at each node in a same directory and all the inode tables were fragmented. In fact, counting the number of files itself took 20 mins. 

I am wondering if there is a parameter that can be tweaked so that the intermediate map outputs are spilled infrequently and appended to existing tmp files instead of creating new ones. Note that I did not run over the default (1024) open fd limit at any point of time. 

Or does the hadoop framework need to be changed in order to do this?</description>
		<content:encoded><![CDATA[<p>I created a hadoop job which ran a billion iterations per mapper and ran it on a Hadoop cluster of 62 dual-quad core nodes. I was also using the combiner optimization to decrease the intermediate data. A million iterations ran under a minute, but the billion iterations ran for &gt; 40 hours.At the end of it, I killed the job but the cleanup seemed to be taking forever. </p>
<p>The framework made &gt; 3 million files at each node in a same directory and all the inode tables were fragmented. In fact, counting the number of files itself took 20 mins. </p>
<p>I am wondering if there is a parameter that can be tweaked so that the intermediate map outputs are spilled infrequently and appended to existing tmp files instead of creating new ones. Note that I did not run over the default (1024) open fd limit at any point of time. </p>
<p>Or does the hadoop framework need to be changed in order to do this?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: andy.edmonds.be &#8250; links for 2009-03-31</title>
		<link>http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/comment-page-1/#comment-615</link>
		<dc:creator>andy.edmonds.be &#8250; links for 2009-03-31</dc:creator>
		<pubDate>Wed, 01 Apr 2009 00:34:25 +0000</pubDate>
		<guid isPermaLink="false">http://www.cloudera.com/blog/?p=380#comment-615</guid>
		<description>[...] Cloudera Hadoop &amp; Big Data Blog &#194;&#187; Blog Archive &#194;&#187; Configuration Parameters: What can you just i... (tags: hadoop configuration performance tuning) [...]</description>
		<content:encoded><![CDATA[<p>[...] Cloudera Hadoop &amp; Big Data Blog &#194;&#187; Blog Archive &#194;&#187; Configuration Parameters: What can you just i&#8230; (tags: hadoop configuration performance tuning) [...]</p>
]]></content:encoded>
	</item>
</channel>
</rss>

