ZooKeeper Health Tests

ZooKeeper Canary

This is a ZooKeeper service-level health test that checks that basic client operations are working and are completing in a reasonable amount of time. This test reports the results of a periodic "canary" test that performs the following sequence of operations. First, it connects to and establishes a session (the root session) with the ZooKeeper service and creates a permanent znode to serve as the root of all canary operations. The canary test then connects to and establishes sessions (the child sessions) with each ZooKeeper server of the service. Each child session is used to create an ephemeral child znode under the canary root. After the child znodes have been created, watches that await znode deletion events are registered with each of the child znodes for each of the child sessions. The canary test then deletes each of the child znodes and then verifies that each child session has received deletion notifications for each of the child znodes. Finally the canary test closes all the child sessions, deletes the root znode and closes the root session. The test returns "Bad" health if the establishment of the root session to the ZooKeeper service fails, the creation of znodes (permanent or ephemeral) fails, the deletion of znodes fails or the retrieval of child znodes of the root znode fails. The test returns "Concerning" health when the canary test succeeds but has one or more servers that could not participate in the canary test operations or if the canary test runs too slowly. A failure of this health test may indicate that ZooKeeper is failing to satisfy client requests correctly or in a timely fashion. Check the status of the ZooKeeper servers, and look in the ZooKeeper server logs for more details. This test can be enabled or disabled using the ZooKeeper Canary Health Check ZooKeeper service monitoring setting. The ZooKeeper Canary Root Znode Path, ZooKeeper Canary Connection Timeout, ZooKeeper Canary Session Timeout, ZooKeeper Canary Operation Timeout settings control the operation of the canary.

Short Name: ZooKeeper Canary

Property Name Description Template Name Default Value Unit
ZooKeeper Canary Connection Timeout Configures the timeout used by the canary for connection establishment with ZooKeeper servers zookeeper_canary_connection_timeout 10000 MILLISECONDS
ZooKeeper Canary Health Check Enables the health check that a client can connect to ZooKeeper and perform basic operations zookeeper_canary_health_enabled true no unit
ZooKeeper Canary Operation Timeout Configures the timeout used by the canary for ZooKeeper operations zookeeper_canary_operation_timeout 30000 MILLISECONDS
ZooKeeper Canary Root Znode Path Configures the path of the root znode under which all canary updates are performed zookeeper_canary_root_path /cloudera_manager_zookeeper_canary no unit
ZooKeeper Canary Session Timeout Configures the timeout used by the canary sessions with ZooKeeper servers zookeeper_canary_session_timeout 30000 MILLISECONDS

ZooKeeper Server Health

This is a ZooKeeper service-level health test that checks that enough of the Servers in the cluster are healthy. The test returns "Concerning" health if the number of healthy Servers falls below a warning threshold, expressed as a percentage of the total number of Servers. The test returns "Bad" health if the number of healthy and "Concerning" Servers falls below a critical threshold, expressed as a percentage of the total number of Servers. For example, if this test is configured with a warning threshold of 95% and a critical threshold of 90% for a cluster of 100 Servers, this test would return "Good" health if 95 or more Servers have good health. This test would return "Concerning" health if at least 90 Servers have either "Good" or "Concerning" health. If more than 10 Servers have bad health, this test would return "Bad" health. A failure of this health test indicates unhealthy Servers. Check the status of the individual Servers for more information. This test can be configured using the ZooKeeper ZooKeeper service-wide monitoring setting.

Short Name: Server Health

Property Name Description Template Name Default Value Unit
Healthy Server Monitoring Thresholds The health test thresholds of the overall Server health. The check returns "Concerning" health if the percentage of "Healthy" Servers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" Servers falls below the critical threshold. zookeeper_servers_healthy_thresholds critical:51.0, warning:99.0 PERCENT