YARN (MR2 Included) Health Tests

YARN (MR2 Included) JobHistory Server Health

This YARN (MR2 Included) service-level health test checks for the presence of a running, healthy JobHistory Server. The test returns "Bad" health if the service is running and the JobHistory Server is not running. In all other cases it returns the health of the JobHistory Server. A failure of this health test indicates a stopped or unhealthy JobHistory Server. Check the status of the JobHistory Server for more information. This test can be enabled or disabled using the JobHistory Server Role Health Test JobHistory Server service-wide monitoring setting.

Short Name: JobHistory Server Health

Property Name Description Template Name Default Value Unit
JobHistory Server Role Health Test When computing the overall YARN health, consider JobHistory Server's health yarn_jobhistoryserver_health_enabled true no unit

YARN (MR2 Included) NodeManager Health

This is a YARN (MR2 Included) service-level health test that checks that enough of the NodeManagers in the cluster are healthy. The test returns "Concerning" health if the number of healthy NodeManagers falls below a warning threshold, expressed as a percentage of the total number of NodeManagers. The test returns "Bad" health if the number of healthy and "Concerning" NodeManagers falls below a critical threshold, expressed as a percentage of the total number of NodeManagers. For example, if this test is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 NodeManagers, this test would return "Good" health if 95 or more NodeManagers have good health. This test would return "Concerning" health if at least 90 NodeManagers have either "Good" or "Concerning" health. If more than 10 NodeManagers have bad health, this test would return "Bad" health. A failure of this health test indicates unhealthy NodeManagers. Check the status of the individual NodeManagers for more information. This test can be configured using the YARN (MR2 Included) YARN (MR2 Included) service-wide monitoring setting.

Short Name: NodeManager Health

Property Name Description Template Name Default Value Unit
Healthy NodeManager Monitoring Thresholds The health test thresholds of the overall NodeManager health. The check returns "Concerning" health if the percentage of "Healthy" NodeManagers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" NodeManagers falls below the critical threshold. yarn_nodemanagers_healthy_thresholds critical:90.0, warning:95.0 PERCENT

YARN (MR2 Included) ResourceManager Health

This YARN (MR2 Included) service-level health test checks for the presence of a running, healthy ResourceManager. The test returns "Bad" health if the service is running and the ResourceManager is not running. In all other cases it returns the health of the ResourceManager. A failure of this health test indicates a stopped or unhealthy ResourceManager. Check the status of the ResourceManager for more information. This test can be enabled or disabled using the ResourceManager Role Health Test ResourceManager service-wide monitoring setting.

Short Name: ResourceManager Health

Property Name Description Template Name Default Value Unit
ResourceManager Role Health Test When computing the overall YARN health, consider ResourceManager's health yarn_resourcemanager_health_enabled true no unit

YARN (MR2 Included) ResourceManager Health (HA)

This is a YARN service-level health test that checks for and active, healthy ResourceManager. The test returns "Bad" health if the service is running and an active ResourceManager cannot be found. If an active ResourceManager is found, then the test checks the health of that ResourceManager as well as the health of any standby ResourceManager configured. A "Good" health result will only be returned if both the active and Standby ResourceManagers are healthy. A failure of this health test may indicate stopped or unhealthy ResourceManager roles, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and the ResourceManagers. Check the status of the YARN service's ResourceManager roles and look in the Cloudera Manager Service Monitor's log files for more information when this test fails. This test can be enabled or disabled using the Active ResourceManager Role Health Check YARN service-wide monitoring setting. The check for a healthy standby ResourceManager can be enabled or disabled with Standby ResourceManager Health Check. In addition, the Active ResourceManager Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active ResourceManager before this health test fails, and the ResourceManager Activation Startup Tolerance can be used to adjust the amount of time around ResourceManager startup that the test allows for a ResourceManager to be made active.

Short Name: ResourceManager Health

Property Name Description Template Name Default Value Unit
Active ResourceManager Detection Window The tolerance window used in YARN service tests that depend on detection of the active ResourceManager. yarn_active_resourcemanager_detecton_window CDH=[[CDH 5.0.0‥CDH 6.0.0)=3] MINUTES
Active ResourceManager Role Health Check When computing the overall YARN service health, whether to consider the active ResourceManager's health. yarn_resourcemanagers_health_enabled CDH=[[CDH 5.0.0‥CDH 6.0.0)=true] no unit
ResourceManager Activation Startup Tolerance The amount of time after ResourceManager(s) start that the lack of an active ResourceManager will be tolerated. This is an advanced option that does not often need to be changed. yarn_resourcemanager_activation_startup_tolerance CDH=[[CDH 5.0.0‥CDH 6.0.0)=180] SECONDS
Standby ResourceManager Health Check When computing the overall YARN service health, whether to consider the health of the standby ResourceManager. yarn_standby_resourcemanager_health_enabled CDH=[[CDH 5.0.0‥CDH 6.0.0)=true] no unit