In Apache Hadoop YARN 3.x (YARN for short), switching to Capacity Scheduler has considerable benefits and only a few drawbacks. To bring these features to users who are currently using Fair Scheduler, we created a tool with the upstream YARN community to help the migration process.
What can we gain by switching to Capacity Scheduler? A couple of examples:
Also, with the release of CDP, Cloudera’s vision is to support Capacity Scheduler as a default scheduler for YARN and phase out Fair Scheduler. Supporting two schedulers simultaneously poses problems: not only does it require more support and engineering competence, but also it needs extra testing, more complicated test cases and test suites due to feature gaps.
After a long and careful analysis, we decided to choose Capacity Scheduler as the default scheduler. We put together a document that compares the features of Capacity Scheduler and Fair Scheduler under YARN-9698 (Direct link).
If you are new to this topic, you can get familiar with Capacity Scheduler by reading the following article: YARN – The Capacity Scheduler
In this blog post, we will:
Please note that although we tested the tool with various Fair Scheduler and YARN site configurations, it’s a new addition to Apache Hadoop. Manual inspection and review of the generated output files are highly recommended.
The converter itself is a CLI application that is part of the yarn command. To invoke the tool, you need to use the yarn fs2cs command with various command-line arguments.
The tool generates two files as output: a capacity-scheduler.xml and a yarn-site.xml. Note that the site configuration is just a delta: it contains only the new settings for the Capacity Scheduler, meaning that you have to manually copy-paste these values into your existing site configuration. Leaving the existing Fair Scheduler properties is unlikely to cause any harm or malfunction, but we recommend removing them to avoid confusion.
The generated properties can also go to the standard output instead of the aforementioned files.
The tool is officially part of the CDH to CDP upgrade, which is documented here.
The basic usage is:
yarn fs2cs -y /path/to/yarn-site.xml [-f /path/to/fair-scheduler.xml] {-o /output/path/ | -p} [-t] [-s] [-d]
Switches listed between [] braces are optional. Curly braces {} mean that the switch is mandatory and you have to pick one.
You can also use the long version of them:
yarn fs2cs --yarnsiteconfig /path/to/yarn-site.xml [--fsconfig /path/to/fair-scheduler.xml] {--output-directory /output/path/ | --print} [--no-terminal-rule-check] [--skip-verification] [--dry-run]
Example:
yarn fs2cs --yarnsiteconfig /home/hadoop/yarn-site.xml --fsconfig /home/hadoop/fair-scheduler.xml --output-directory /tmp
Important: always use the absolute path for -f / –fsconfig.
For a list of all command-line switches with explanation, you can use yarn fs2cs –help. The CLI options are listed in this document.
Let’s see a short demonstration of the tool.
Suppose we have the following simple fair-scheduler.xml:
<allocations>
<queue name="root">
<weight>1.0</weight>
<schedulingPolicy>drf</schedulingPolicy>
<queue name="default">
<weight>1.0</weight>
<schedulingPolicy>drf</schedulingPolicy>
</queue>
<queue name="users" type="parent">
<maxChildResources>memory-mb=8192, vcores=1</maxChildResources>
<weight>1.0</weight>
<schedulingPolicy>drf</schedulingPolicy>
</queue>
</queue>
<queuePlacementPolicy>
<rule name="specified" create="true"/>
<rule name="nestedUserQueue" create="true">
<rule name="default" create="true" queue="users"/>
</rule>
<rule name="default"/>
</queuePlacementPolicy>
</allocations>
We also have the following entries in yarn-site.xml (listing only those which are related to Fair Scheduler):
yarn.scheduler.fair.allow-undeclared-pools = true yarn.scheduler.fair.user-as-default-queue = true yarn.scheduler.fair.preemption = false yarn.scheduler.fair.preemption.cluster-utilization-threshold = 0.8 yarn.scheduler.fair.sizebasedweight = false yarn.scheduler.fair.assignmultiple = true yarn.scheduler.fair.dynamicmaxassign = true yarn.scheduler.fair.maxassign = -1 yarn.scheduler.fair.continuous-scheduling-enabled = false yarn.scheduler.fair.locality-delay-node-ms = 2000
Let’s run the converter for these files:
~$ yarn fs2cs -y /home/examples/yarn-site.xml -f /home/examples/fair-scheduler.xml -o /tmp 2020-05-05 14:22:41,384 INFO [main] converter.FSConfigToCSConfigConverter (FSConfigToCSConfigConverter.java:prepareOutputFiles(138)) - Output directory for yarn-site.xml and capacity-scheduler.xml is: /tmp 2020-05-05 14:22:41,388 INFO [main] converter.FSConfigToCSConfigConverter (FSConfigToCSConfigConverter.java:loadConversionRules(177)) - Conversion rules file is not defined, using default conversion config! [...] output trimmed for brevity 2020-05-05 14:22:42,572 ERROR [main] converter.FSConfigToCSConfigConverterMain (MarkerIgnoringBase.java:error(159)) - Error while starting FS configuration conversion! [...] output trimmed for brevity Caused by: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: Rules after rule 2 in queue placement policy can never be reached at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementPolicy.updateRuleSet(QueuePlacementPolicy.java:115) [...]
This is a very typical error. If we look at the placement rules of fair-scheduler.xml, we can see that a default rule comes after nestedUserQueue. We need to use the –no-terminal-rule-check switch to ignore terminal rule check inside Fair Scheduler. Why? See the section below.
By default, Fair Scheduler performs a strict check of whether a placement rule is terminal or not. This means that if you use a <reject> rule which is followed by a <specified> rule, then this is not allowed, since the latter is unreachable. However, before YARN-8967 (Change FairScheduler to use PlacementRule interface), Fair Scheduler was more lenient and allowed certain sequences of rules that are no longer valid. As said before, we instantiate a Fair Scheduler instance with the tool to read and parse the allocation file. In order to have Fair Scheduler accept such configurations, the -t or –no-terminal-rule-check argument must be supplied to suppress the exception thrown by Fair Scheduler. In CDH 5.x, these kinds of placement configurations are common, so it’s recommended to always use -t.
~$ yarn fs2cs -y /home/examples/yarn-site.xml -f /home/examples/fair-scheduler.xml -o /tmp --no-terminal-rule-check 2020-05-05 14:41:39,189 INFO [main] capacity.CapacityScheduler (CapacityScheduler.java:initScheduler(384)) - Initialized CapacityScheduler with calculator=class org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator, minimumAllocation=<>, maximumAllocation=<>, asynchronousScheduling=false, asyncScheduleInterval=5ms,multiNodePlacementEnabled=false 2020-05-05 14:41:39,190 INFO [main] converter.ConvertedConfigValidator (ConvertedConfigValidator.java:validateConvertedConfig(72)) - Capacity scheduler was successfully started This time, the conversion succeeded!
There are a couple of things worth mentioning from the log:
2020-05-05 14:41:38,908 WARN [main] converter.FSConfigToCSConfigRuleHandler (ConversionOptions.java:handleWarning(48)) - Settingis not supported, ignoring conversion 2020-05-05 14:41:38,945 WARN [main] converter.FSConfigToCSConfigRuleHandler (ConversionOptions.java:handleWarning(48)) - Setting is not supported, ignoring conversion
If we look at /tmp/yarn-site.xml, we can see that it is really short:
yarn.scheduler.capacity.resource-calculator = org.apache.hadoop.yarn.util.resource.DominantResourceCalculator yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled = true yarn.resourcemanager.scheduler.class = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
Well, there is not much here. It is because a lot of scheduling related settings are disabled: there is no preemption, there is no continuous scheduling, there is no rack or node locality threshold set.
Let’s take a look at the new capacity-scheduler.xml (again, this is formatted here and the unnecessary XML tags were removed):
yarn.scheduler.capacity.root.users.maximum-capacity = 100 yarn.scheduler.capacity.root.default.capacity = 50.000 yarn.scheduler.capacity.root.default.ordering-policy = fair yarn.scheduler.capacity.root.users.capacity = 50.000 yarn.scheduler.capacity.root.default.maximum-capacity = 100 yarn.scheduler.capacity.root.queues = default,users yarn.scheduler.capacity.root.maximum-capacity = 100 yarn.scheduler.capacity.maximum-am-resource-percent = 0.5
Notice the property yarn.scheduler.capacity.maximum-am-resource-percent which is set to 0.5. This is missing from fair-scheduler.xml, so why is it here? The tool has to set it, because the default setting in Capacity Scheduler is 10%, but in Fair Scheduler, it is 50%.
Let’s modify the following properties:
yarn.scheduler.fair.preemption - true yarn.scheduler.fair.sizebasedweight - true yarn.scheduler.fair.continuous-scheduling-enabled - true
After running the conversion again, these settings are now reflected in the new yarn-site.xml:
yarn.scheduler.capacity.resource-calculator = org.apache.hadoop.yarn.util.resource.DominantResourceCalculator yarn.scheduler.capacity.schedule-asynchronously.scheduling-interval-ms = 5 yarn.scheduler.capacity.schedule-asynchronously.enable = true yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval = 10000 yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill = 15000 yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled = true yarn.resourcemanager.scheduler.class = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler yarn.resourcemanager.scheduler.monitor.enable = true
The size-based weight setting has also affected capacity-scheduler.xml:
yarn.scheduler.capacity.root.default.ordering-policy.fair.enable-size-based-weight = true yarn.scheduler.capacity.root.users.ordering-policy.fair.enable-size-based-weight = true yarn.scheduler.capacity.root.users.capacity = 50.000 yarn.scheduler.capacity.root.queues = default,users yarn.scheduler.capacity.root.users.maximum-capacity = 100 yarn.scheduler.capacity.root.ordering-policy.fair.enable-size-based-weight = true [...] rest is omitted because it’s the same as before
A crucial question is how to convert weights. The weight determines the “fair share” of a queue in the long run. Fair share is the amount of resources a queue can get, limiting how much resources applications that are submitted to that given queue can use.
For example, if “root.a” and “root.b” have weights of 3 and 1, respectively, then “root.a” will get 75% of the total cluster resources and “root.b” will get 25%.
But what if we submit applications to “root.b” only? As long as “root.a” is empty, applications in “root.b” can freely occupy the whole cluster (let’s ignore <maxResources> for now).
It turns out that Capacity Scheduler’s “capacity” is very close to the concept of weight, only that it is expressed as percentages, not as integers. But by default, capacity is capped – meaning that “root.b” with a capacity of 25.00 will always use only 25% of the cluster. This is where the concept of elasticity comes in. Elasticity means that free resources in the cluster can be allocated to a queue beyond its default capacity. This value is also expressed in percentages. Therefore, we have to enable full elasticity for all queues.
In summary, we can achieve Fair Scheduler-like behavior with the following properties:
Weights in Fair Scheduler:
root.a = 3 root.b = 1
The respective settings for Capacity Scheduler:
yarn.scheduler.capacity.root.a.capacity = 75.000 yarn.scheduler.capacity.root.a.maximum-capacity = 100.000 yarn.scheduler.capacity.root.b.capacity = 25.000 yarn.scheduler.capacity.root.b.maximum-capacity = 100.000
Let’s imagine the following simple queue hierarchy with weights in Fair Scheduler:
root = 1 root.users = 20 root.default = 10 root.users.alice = 3 root.users.bob = 1
This results in the following capacity values after the conversion:
yarn.scheduler.capacity.root.capacity = 100.000 yarn.scheduler.capacity.root.maximum-capacity = 100.000 yarn.scheduler.capacity.root.users.capacity = 66.667 yarn.scheduler.capacity.root.users.maximum-capacity = 100.000 yarn.scheduler.capacity.root.default.capacity = 33.333 yarn.scheduler.capacity.root.default.maximum-capacity = 100.000 yarn.scheduler.capacity.root.users.alice.capacity = 75.000 yarn.scheduler.capacity.root.users.alice.maximum-capacity = 100.000 yarn.scheduler.capacity.root.users.bob.capacity = 25.000 yarn.scheduler.capacity.root.users.bob.maximum-capacity = 100.000
After performing some basic validation steps (for example, if the output directory exists, input files exist, etc), it loads the yarn-site.xml and converts scheduling related properties such as preemption, continuous scheduling and rack/node locality settings.
The tool uses a Fair Scheduler instance to load and parse the allocation file. It also detects unsupported properties and displays a warning message each, that the particular setting will not be converted. Unsupported settings and known limitations will be explained later in this post.
Once the conversion is done and the output files are generated, the last step is validation. By default, fs2cs tries to start Capacity Scheduler internally using the converted configuration.
This step ensures that the Resource Manager is able to start up properly using the new configuration.
At the moment there are some feature gaps between Fair Scheduler and Capacity Scheduler – that is, a full conversion is possible only if you don’t use settings in your Fair Scheduler configuration currently not implemented in Capacity Scheduler.
Conversion of the reservation system settings is completely skipped and this will probably not change in the foreseeable future. The reason is that it is not a frequently used feature and it works completely differently in the two schedulers.
A placement rule in Fair Scheduler defines which queue a submitted YARN application should be placed into. Placement rules follow a “fall through” logic: if the first rule does not apply (the queue returned by the rule does not exist), then the next one is attempted and so on. If the last rule fails to return a valid queue, then the application submission is rejected.
Capacity Scheduler employs a conceptually similar approach called mapping rules. The implementation is different though: converting placement rules to mapping rules cannot be done properly at the moment. There are multiple reasons for this:
These differences make it difficult, sometimes impossible to convert placement rules to mapping rules. In such circumstances, cluster operators have to be creative and deviate from their original placement algorithm.
The following properties will not be converted by the tool:
There is still active development going on to provide a better user experience. The most important tasks are:
The fs2cs tool has become an integral part of the CDH-CDP upgrade path which helps customers transform their Fair Scheduler-based configuration to Capacity Scheduler. We learned why switching to Capacity Scheduler has clear benefits.
We have seen that not everything is perfect at the moment. There are certain features in Fair Scheduler that are either missing or just partly supported in Capacity Scheduler. When such a setting is encountered during the conversion, the tool displays a warning message.
Some aspects of the conversion are quite challenging, especially the placement rules. Even though conceptually similar, the two schedulers follow slightly different queue placement philosophies and this requires extra effort on our part to make Capacity Scheduler mapping rules work the same way.
Nonetheless, we are committed to implementing all necessary changes to increase customer satisfaction and improve user experience.
Thanks to Wilfred Spiegelenburg and Prabhu Josephraj for creating the table about the differences of the two schedulers and providing useful insights and ideas.
Szilard Németh contributed to the tool itself by handling and processing command line arguments and he also reviewed this blogpost.
I’d also like to thank those who provided useful feedback: Ádám Antal, Rudolf Réti, Bejnámin Teke, András Győri, Wangda Tan.
This may have been caused by one of the following: