<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera &#187; scheduling</title>
	<atom:link href="http://www.cloudera.com/blog/tag/scheduling/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudera.com</link>
	<description>Hadoop and Cloudera&#039;s Products and Services</description>
	<lastBuildDate>Thu, 24 May 2012 17:53:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Upcoming Functionality in &#8220;Fair Scheduler 2.0&#8243;</title>
		<link>http://www.cloudera.com/blog/2009/04/upcoming-functionality-in-fair-scheduler-20/</link>
		<comments>http://www.cloudera.com/blog/2009/04/upcoming-functionality-in-fair-scheduler-20/#comments</comments>
		<pubDate>Fri, 03 Apr 2009 16:27:19 +0000</pubDate>
		<dc:creator>Amr Awadallah</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[resource management]]></category>
		<category><![CDATA[scheduling]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/blog/?p=400</guid>
		<description><![CDATA[(guest blog post by Matei Zaharia) As Hadoop clusters grow in size and data volume, it becomes more and more useful to share them between multiple users and to isolate these users. If User 1 is running a ten-hour machine learning job for example, this should not impair a User 2 from running a 2-minute [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p><em>(guest blog post by </em><a href="http://www.linkedin.com/in/mateizaharia"><em>Matei Zaharia</em></a><em>)</em></p></blockquote>
<p>As Hadoop clusters grow in size and data volume, it becomes more and more useful to share them between multiple users and to isolate these users. If User 1 is running a ten-hour machine learning job for example, this should not impair a User 2 from running a 2-minute <a href="http://hadoop.apache.org/hive">Hive</a> query. In November, I <a href="http://www.cloudera.com/blog/2008/11/23/job-scheduling-in-hadoop/">blogged</a> about how Hadoop 0.19 supports pluggable job schedulers, and how we worked with Facebook to implement a Fair Scheduler for Hadoop using this new functionality. The Fair Scheduler gives each user a configurable share of the cluster when he/she has running jobs, but assigns these resources to other users when the user is inactive. Since last fall, the Fair Scheduler has been picked up by Hadoop users outside Facebook, including the Google/IBM academic Hadoop cluster. It&#8217;s also received extensive testing and patches from Yahoo!. Furthermore, we&#8217;ve included the Fair Scheduler in <a href="http://www.cloudera.com/hadoop">Cloudera&#8217;s Distribution for Hadoop</a>, where it is integrated right into the JobTracker management UI. Through production experiences, testing, and feedback from users, we&#8217;ve made a lot of improvements to the Fair Scheduler, some of which are available now and others which will come out in the next major version, which I&#8217;m calling &#8220;Fair Scheduler 2.0&#8243;. Here is a summary of the upcoming functionality:</p>
<ol>
<li>Fair sharing has changed from giving equal shares to each job to giving equal shares to each user. This means that users that submitted many jobs don&#8217;t get an advantage over users running a few jobs. It&#8217;s also possible to give different weights to different users.</li>
<li>The fair scheduler now supports killing tasks from other users&#8217; jobs if they are not giving them up. For each pool (by default there is one pool per user, but one can also have specially named pools), there&#8217;s a configurable timeout after which it can kill other jobs&#8217; tasks to start running. This means that it&#8217;s possible to provide &#8220;service guarantees&#8221; for production jobs that are sharing a cluster with experimental queries.</li>
<li>The scheduler can now assign multiple tasks per heartbeat, which is important for maintaining high utilization in large clusters.</li>
<li>A technique called <a href="https://issues.apache.org/jira/browse/HADOOP-4667">delay scheduling</a> increases data locality for small jobs, improving performance in a data warehouse workload with many small jobs such as Facebook&#8217;s.</li>
<li>The internal logic has been simplified so that the scheduler can support different scheduling policies within each pool, and in particular we plan to support FIFO pools. Many users have requested FIFO pools because they want to be able to queue up batch workflows on the same cluster that&#8217;s running more interactive jobs.</li>
<li>Many bug fixes and performance improvements were contributed or suggested by a team stress-testing the scheduler at Yahoo!.</li>
<li>The same team has also contributed Forrest web-based documentation for the fair scheduler (to be available in Hadoop 0.20).</li>
</ol>
<p>As a grad student and the original developer of the Fair Scheduler, I&#8217;ve had a great experience interacting with the Hadoop community to improve the scheduler. The fact that production experience at Facebook, large-scale testing at Yahoo!, and wishes from other users are being combined into this single piece of software is a testament to the strength of Hadoop&#8217;s open-source model. The next release of the Fair Scheduler (likely in Hadoop 0.21, although we will also release back-ports to older Hadoop versions) will make it easier to manage multi-user clusters, give FIFO scheduling to users who desire it, improve performance and reduce the need for manual intervention with misbehaving jobs. You can also be sure that we&#8217;ll continue supporting the scheduler in <a href="http://www.cloudera.com/hadoop">Cloudera&#8217;s Distribution for Hadoop</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2009/04/upcoming-functionality-in-fair-scheduler-20/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Job Scheduling in Hadoop</title>
		<link>http://www.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/</link>
		<comments>http://www.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/#comments</comments>
		<pubDate>Sun, 23 Nov 2008 21:33:52 +0000</pubDate>
		<dc:creator>Amr Awadallah</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[administration]]></category>
		<category><![CDATA[scheduling]]></category>

		<guid isPermaLink="false">http://www.cloudera.com/blog/?p=103</guid>
		<description><![CDATA[(guest blog post by Matei Zaharia) When Hadoop started out, it was designed mainly for running large batch jobs such as web indexing and log mining. Users submitted jobs to a queue, and the cluster ran them in order. However, as organizations placed more data in their Hadoop clusters and developed more computations they wanted [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p><em>(guest blog post by </em><a href="http://www.linkedin.com/in/mateizaharia"><em>Matei Zaharia</em></a><em>)</em></p></blockquote>
<p>When Hadoop started out, it was designed mainly for running large batch jobs such as web indexing and log mining. Users submitted jobs to a queue, and the cluster ran them in order. However, as organizations placed more data in their Hadoop clusters and developed more computations they wanted to run, another use case became attractive: sharing a MapReduce cluster between multiple users. The benefits of sharing are tremendous: with all the data in one place, users can run queries that they may never have been able to execute otherwise, and costs go down because system utilization is higher than building a separate Hadoop cluster for each group. However, sharing requires support from the Hadoop job scheduler to provide guaranteed capacity to production jobs and good response time to interactive jobs while allocating resources fairly between users.</p>
<p>This July, the scheduler in Hadoop <a href="https://issues.apache.org/jira/browse/HADOOP-3412">became a pluggable component</a> and opened the door for innovation in this space. The result was two schedulers for multi-user workloads: the <a href="https://issues.apache.org/jira/browse/HADOOP-3746">Fair Scheduler</a>, developed at Facebook, and the <a href="https://issues.apache.org/jira/browse/HADOOP-3445">Capacity Scheduler</a>, developed at Yahoo.</p>
<p>The Fair Scheduler arose out of Facebook&#8217;s need to share its data warehouse between multiple users. Facebook started using Hadoop to manage the large amounts of content and log data it accumulated every day. Initially, there were only a few jobs that needed to run on the data each day to build reports. However, as other groups within Facebook started to use Hadoop, the number of production jobs increased. In addition, analysts started using the data warehouse for ad-hoc queries through <a href="http://wiki.apache.org/hadoop/Hive">Hive</a> (Facebook&#8217;s SQL-like query language for Hadoop), and more large batch jobs were submitted as developers experimented with the data set. Facebook&#8217;s data team considered building a separate cluster for the production jobs, but saw that this would be extremely expensive, as data would have to be replicated and the utilization on both clusters would be low. Instead, Facebook built the Fair Scheduler, which allocates resources evenly between multiple jobs and also supports capacity guarantees for production jobs. The Fair Scheduler is based on three concepts:</p>
<ul>
<li>Jobs are placed into named &#8220;pools&#8221; based on a configurable attribute such as user name, Unix group, or specifically tagging a job as being in a particular pool through its jobconf.</li>
<li>Each pool can have a &#8220;guaranteed capacity&#8221; that is specified through a config file, which gives a minimum number of map slots and reduce slots to allocate to the pool. When there are pending jobs in the pool, it gets at least this many slots, but if it has no jobs, the slots can be used by other pools.</li>
<li>Excess capacity that is not going toward a pool&#8217;s minimum is allocated between jobs using fair sharing. Fair sharing ensures that over time, each job receives roughly the same amount of resources. This means that shorter jobs will finish quickly, while longer jobs are guaranteed not to get starved.</li>
</ul>
<p>The scheduler also includes a number of features for ease of administration, including the ability to reload the config file at runtime to change pool settings without restarting the cluster, limits on running jobs per user and per pool, and use of priorities to weigh the shares of different jobs. There is currently no support for preemption of long tasks, but this is being added in <a href="https://issues.apache.org/jira/browse/HADOOP-4665">HADOOP-4665</a>, which will allow you to set how long each pool will wait before preempting other jobs&#8217; tasks to reach its guaranteed capacity.</p>
<p>The Fair Scheduler has been in production use at Facebook since August. You can find it in the Hadoop trunk code under src/contrib/fairscheduler, and there are also versions of the scheduler for Hadoop 0.17 and Hadoop 0.18 on its <a href="https://issues.apache.org/jira/browse/HADOOP-3746">JIRA page</a>. All of these versions come with a <a href="https://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/fairscheduler/README">README file</a> explaining how to set up the scheduler that is placed under src/contrib/fairscheduler.</p>
<p>The <a href="https://issues.apache.org/jira/browse/HADOOP-3445">Capacity Scheduler</a> from Yahoo offers similar functionality to the Fair Scheduler but takes a somewhat different philosophy. In the Capacity Scheduler, you define a number of named queues. Each queue has a configurable number of map and reduce slots. The scheduler gives each queue its capacity when it contains jobs, and shares any unused capacity between the queues. However, within each queue, FIFO scheduling with priorities is used, except for one aspect &#8211; you can place a limit on percent of running tasks per user, so that users share a cluster equally. In other words, the capacity scheduler tries to simulate a separate FIFO/priority cluster for each user and each organization, rather than performing fair sharing between all jobs. The Capacity Scheduler also supports configuring a wait time on each queue after which it is allowed to preempt other queues&#8217; tasks if it is below its fair share. Documentation for the scheduler can be built as described in its <a href="https://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/capacity-scheduler/README">README file</a> under src/contrib/capacity-scheduler in the Hadoop trunk SVN.</p>
<p>Now that the Fair Scheduler and Capacity Scheduler are available, there has been increased focus on other aspects of multi-user Hadoop clusters, such as isolating users and improving performance for the short interactive jobs seen in these environments. This has led to some exciting scheduling-related patches you can expect to see in future Hadoop releases:</p>
<ul>
<li><a href="https://issues.apache.org/jira/browse/HADOOP-4487">HADOOP-4487</a>, which adds a number of security features to isolate users.</li>
<li><a href="https://issues.apache.org/jira/browse/HADOOP-3136">HADOOP-3136</a>, which lets the scheduler launch multiple tasks per heartbeat, improving &#8220;ramp-up time&#8221;.</li>
<li><a href="https://issues.apache.org/jira/browse/HADOOP-4664">HADOOP-4664</a>, <a href="https://issues.apache.org/jira/browse/HADOOP-4513">4513</a> and <a href="https://issues.apache.org/jira/browse/HADOOP-4372">4372</a>, which parallelize job initialization to launch small jobs faster.</li>
<li><a href="https://issues.apache.org/jira/browse/HADOOP-2014">HADOOP-2014</a>, which chooses input blocks from overloaded racks when launching non-local maps.</li>
<li><a href="https://issues.apache.org/jira/browse/HADOOP-3759">HADOOP-3759</a> and <a href="https://issues.apache.org/jira/browse/HADOOP-657">657</a>, which take into account tasks&#8217; memory and disk space requirements to prevent oversubscribing nodes.</li>
<li><a href="https://issues.apache.org/jira/browse/HADOOP-4664">HADOOP-4667</a>, which improves locality for small jobs in the fair scheduler by letting it look at multiple jobs to select a local task.</li>
</ul>
<p>With the recent progress on scheduling, Hadoop is quickly growing to support the kind of multi-user data warehouse seen at Facebook: short interactive jobs, large batch jobs, and guaranteed-capacity production jobs sharing a cluster and delivering results quickly while maintaining high throughput. With a job scheduler that protects production jobs, users can try interesting R&amp;D experiments on your data set and gain valuable insights without worrying about affecting mission-critical jobs.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

