MapReduce Health Tests

MapReduce Failover Controllers Health

This is a MapReduce service-level health test that checks that all the Failover Controllers associated with this service are healthy and running. The test returns "Bad" health if any of Failover Controllers that the service depends on is unhealthy or not running. Check the Failover Controllers logs for more details This test can be enabled or disabled using the Failover Controllers Healthy service-wide monitoring setting.

Short Name: Failover Controllers Health

Property Name Description Template Name Default Value Unit
Failover Controllers Healthy Enables the health check that verifies that the failover controllers associated with this service are healthy and running. failover_controllers_healthy_enabled true no unit

MapReduce JobTracker Health

This is a MapReduce service-level health test that checks for and active, healthy JobTracker. The test returns "Bad" health if the service is running and an active JobTracker cannot be found. If an active JobTracker is found, then the test checks the health of that JobTracker as well as the health of any standby JobTracker configured. A "Good" health result will only be returned if both the active and Standby JobTrackers are healthy. A failure of this health test may indicate stopped or unhealthy JobTracker roles, or it may indicate a problem with communication between the Cloudera Manager Service Monitor and the JobTrackers. Check the status of the MapReduce service's JobTracker roles and look in the Cloudera Manager Service Monitor's log files for more information when this test fails. This test can be enabled or disabled using the JobTracker Role Health Test MapReduce service-wide monitoring setting. The check for a healthy standby JobTracker can be enabled or disabled with Standby JobTracker Health Test. In addition, the Active JobTracker Detection Window can be used to adjust the amount of time that the Cloudera Manager Service Monitor has to detect the active JobTracker before this health test fails, and the JobTracker Activation Startup Tolerance can be used to adjust the amount of time around JobTracker startup that the test allows for the JobTracker to be made active.

Short Name: JobTracker Health

Property Name Description Template Name Default Value Unit
Active JobTracker Detection Window The tolerance window that will be used in Mapreduce service tests that depend on detection of the active JobTracker. mapreduce_active_jobtracker_detection_window 3 MINUTES
JobTracker Activation Startup Tolerance The amount of time after JobTracker(s) start that the lack of an active JobTracker will be tolerated. This is intended to allow either the auto-failover daemon to make a JobTracker active, or a specifically issued failover command to take effect. This is an advanced option that does not often need to be changed. mapreduce_jobtracker_activation_startup_tolerance 180 SECONDS
JobTracker Role Health Test When computing the overall MapReduce cluster health, consider the JobTracker's health mapreduce_jobtracker_health_enabled true no unit
Standby JobTracker Health Test When computing the overall cluster health, consider the health of the standby JobTracker. mapreduce_standby_jobtrackers_health_enabled true no unit

MapReduce TaskTracker Health

This is a MapReduce service-level health test that checks that enough of the TaskTrackers in the cluster are healthy. The test returns "Concerning" health if the number of healthy TaskTrackers falls below a warning threshold, expressed as a percentage of the total number of TaskTrackers. The test returns "Bad" health if the number of healthy and "Concerning" TaskTrackers falls below a critical threshold, expressed as a percentage of the total number of TaskTrackers. For example, if this test is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 TaskTrackers, this test would return "Good" health if 95 or more TaskTrackers have good health. This test would return "Concerning" health if at least 90 TaskTrackers have either "Good" or "Concerning" health. If more than 10 TaskTrackers have bad health, this test would return "Bad" health. A failure of this health test indicates unhealthy TaskTrackers. Check the status of the individual TaskTrackers for more information. This test can be configured using the MapReduce MapReduce service-wide monitoring setting.

Short Name: TaskTracker Health

Property Name Description Template Name Default Value Unit
Healthy TaskTracker Monitoring Thresholds The health test thresholds of the overall TaskTracker health. The check returns "Concerning" health if the percentage of "Healthy" TaskTrackers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" TaskTrackers falls below the critical threshold. mapreduce_tasktrackers_healthy_thresholds critical:90.0, warning:95.0 PERCENT