In this installment, we provide insight into how the Fair Scheduler works, and why it works the way it does.
In Part 3 of this series, you got a quick introduction to Fair Scheduler, one of the scheduler choices in Apache Hadoop YARN (and the one recommended by Cloudera). In Part 4, we will cover most of the queue properties, some examples of their use, as well as their limitations.
By default, the Fair Scheduler uses a single default queue. Creating additional queues and setting the appropriate properties allows for more fine-grained control of how applications are run on the cluster.
FairShare is simply an amount of YARN cluster resources. We will use the notation <memory: 100GB, vcores: 25> to indicate a FairShare of 100GB and 25 vcores.
<weight>Numerical Value 0.0 or greater</weight>
Notes:
There are two types of resource constraints:
<minResources>20000 mb, 10 vcores</minResources> <minResources>2000000 mb, 1000 vcores</minResources>
Notes:
There are two ways to limit applications on the queue:
<maxRunningApps>10</maxRunningApps>
<maxAMShare>0.3</maxAMShare>
Notes:
These properties are used to limit who can submit to the queue and who can administer applications (i.e. kill) on the queue.
<aclSubmitApps>user1,user2,user3,... group1,group2,...</aclSubmitApps> <aclAdministerApps>userA,userB,userC,... groupA,groupB,...</aclAdministerApps>
Note: If yarn.acl.enable is set to true in yarn-site.xml, then the yarn.admin.acl property will also be considered a list of valid administrators in addition to the aclAdministerApps queue property.
Once you configure the queues, queue placement policies serve the following purpose:
Queue-placement policy tells Fair Scheduler where to place applications in queues. The queue-placement policy:
<queuePlacementPolicy>
<Rule #1>
<Rule #2>
<Rule #3>
.
.
</queuePlacementPolicy>
will match against “Rule #1” and if that fails, match against “Rule #2” and so on until a rule matches. There are a predefined set of rules from which to choose and their behavior can be modeled as a flowchart. The last rule must be a terminal rule, which is either the default rule or the reject rule.
If no queue-placement policy is set, then FairScheduler will use a default rule based on the properties yarn.scheduler.fair.user-as-default-queue and yarn.scheduler.fair.allow-undeclared-pools properties in the yarn-site.xml file.
All the queue-placement policy rules allow for an XML attribute called create which can be set to “true” or “false”. For example:
<rule name=”specified” create=”false”>
If the create attribute is true, then the rule is allowed to create the queue by the name determined by the particular rule if it doesn’t already exist. If the create attribute is false, the rule is not allowed to create the queue and queue placement will move on to the next rule.
This section gives an example for each type of queue-placement policy available and provides a flowchart for how the scheduler determines which application goes to which queue (or whether the application creates a new queue).
<rule name="nestedUserQueue" create=”true”>
<!-- Put one of the other Queue Placement Policies here. -->
</rule>
<rule name="nestedUserQueue" create=”true”>
<rule name="primaryGroup" create="false" />
</rule>
Summary: Define a default queue.
Example rule in XML: <rule name="default" queue="default" />
Flowchart:
Sidebar: Application Scheduling StatusAs discussed in Part 1 of this series, an application comprises one or more tasks, with each task running in a container. For purposes of this blog post, an application can be in one of four states:
|
Assume that we have a YARN cluster with total resources <memory: 800GB, vcores 200> with two queues: root.busy (weight=1.0) and root.sometimes_busy (weight 3.0). There are generally four scenarios of interest:
So, how can Scenario A be avoided?
One solution would be to set maxResources on the busy queue. Suppose the maxResources attribute is set on the busy queue to the equivalent of 25% of the cluster. Since maxResources is a hard limit, the applications in the busy queue will be limited to the 25% total at all times. So, in the case where 100% of the cluster could be used, the cluster utilization is actually closer to 35% (10% for the sometimes_busy queue and 25% for the busy queue).
Scenario A will improve noticeably since there are free resources on the cluster that can only go to applications in the sometimes_busy queue, but the average cluster utilization would likely be low.
Going forward, we will refer to Instantaneous FairShare as simply “FairShare.”
Given these new definitions, the previous scenarios can be phrased as follows:
In Scenario A, you can see the imbalance that both queues have between their Allocations and their FairShare. The balance is slowly returned as containers are release from the busy queue and allocated to the sometimes_busy queue.
By turning on preemption, the Fair Scheduler can kill containers in the busy queue and allocate them more quickly to the sometimes_busy queue.
To turn on preemption, set this property in yarn-site.xml:
<property>yarn.scheduler.fair.preemption</property>
<value>true</value>
Then, in your FairScheduler allocation file, preemption can be configured on a queue via fairSharePreemptionThreshold and fairSharePreemptionTimeout as shown in the example below. The fairSharePreemptionTimeout is the number of seconds the queue is under fairSharePreemptionThreshold before it will try to preempt containers to take resources from other queues.
<allocations>
<queue name="busy">
<weight>1.0</weight>
</queue>
<queue name="sometimes_busy">
<weight>3.0</weight>
<fairSharePreemptionThreshold>0.50</fairSharePreemptionThreshold>
<fairSharePreemptionTimeout>60</fairSharePreemptionTimeout>
</queue>
<queuePlacementPolicy>
<rule name="specified" />
<rule name=”reject” />
</queuePlacementPolicy>
</allocations>
Recall that the FairShare of the sometimes_busy queue is <memory: 600GB, vcores: 150>. The two new properties tell FairScheduler that the sometimes_busy queue will wait 60 seconds before beginning preemption. If in that time, it has not received 50% of its FairShare in resources, FairScheduler can begin killing containers in the busy queue and allocating them to the sometimes_busy queue.
A couple things of note:
(Note: We will not cover minResources and minSharePreemptionTimeout on a queue. FairShare preemption is currently recommended.)
To get more information about Fair Scheduler, take a look at the online documentation (Apache Hadoop and CDH versions are available).
Part 5 will provide specific examples, such as job prioritization, and how to use queues and their properties to handle the situation.
Ray Chiang is a Software Engineer at Cloudera.
Dennis Dawson is a Technical Writer at Cloudera.
This may have been caused by one of the following: