Configuring Monitoring Settings

Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)

There are several types of monitoring settings you can configure in Cloudera Manager:

Health tests - For a service or role for which monitoring is provided, you can enable and disable selected health tests and events, configure how those health tests factor into the overall health of the service, and modify thresholds for the status of certain health tests. For hosts you can disable or enable selected health tests, modify thresholds, and enable or disable health alerts.
Free space - For hosts, you can set threshold-based monitoring of free space in the various directories on the hosts Cloudera Manager monitors.
Activities - For MapReduce, YARN, and Impala services, you can configure aspects of how Cloudera Manager monitors activities, applications, and queries.
Alerts - For all roles you can configure health alerts and configuration change alerts. You can also configure some service specific alerts and how alerts are delivered.
Log events - For all roles you can configure logging thresholds, log directories, log event capture, when log messages become events, and when to generate log alerts.
Monitoring roles - For the Cloudera Management Service you can configure monitoring settings for the monitoring roles themselves—enable and disable health tests on the monitoring processes as well as configuring some general settings related to events and alerts (specifically with the Event Server and Alert Publisher). Each of the Cloudera Management Service roles has its own parameters that can be modified to specify how much data is retained by that service. For some monitoring functions, the amount of retained data can grow very large, so it may become necessary to adjust the limits.

For general information about modifying configuration settings, see Modifying Configuration Properties Using Cloudera Manager.

This section covers the following topics:

Configuring Health Monitoring
- Configuring Service Monitoring
- Configuring Host Monitoring
Configuring Directory Monitoring
Configuring Activity Monitoring
- Activity Duration Rules
Configuring YARN Application Monitoring
- Configuring Application Visibility
Configuring Impala Query Monitoring
- Configuring Query Visibility
- Configuring Impala Query Data Store Maximum Size
Configuring Alerts
Configuring Log Events

Configuring Health Monitoring

The initial health monitoring configuration is handled during the installation and configuration of your cluster, and most monitoring parameters have default settings. However, you can set or modify these at any time.

Depending on the service or role you select, and the configuration category, you can enable or disable health tests, determine when health tests cause alerts, or determine whether specific health tests are used in computing the overall health of a role or service. In most cases you can disable these "roll-up" health tests separately from the individual health tests.

As a rule, a health test whose result is considered "Concerning" or "Bad" is forwarded as an event to the Event Server. That includes health tests whose results are based on configured Warning or Critical thresholds, as well pass-fail type health tests. An event is also published when the health test result returns to normal.

You can control when an individual health test is forwarded as an event or as an alert by modifying the threshold values for the relevant health test.

Configuring Service Monitoring

Select Clusters > cluster_name > service_name.
Click the Configuration tab.
Select Scope > service_name (Service-Wide).
Select Category > Monitoring.
Locate the property to change or search for it by typing its name in the Search box.
Configure the property.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

Configuring Host Monitoring

Click the Hosts tab.
Select a host.
Click the Configuration tab.
Select Scope > All.
Click the Monitoring category.
Configure the property.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

Configuring Directory Monitoring

Cloudera Manager can perform threshold-based monitoring of free space in the various directories on the hosts it monitors—such as log directories or checkpoint directories (for the Secondary NameNode).

These thresholds can be set in one of two ways—as absolute thresholds (in terms of MiB and GiB, and so on) or as percentages of space. As with other threshold properties, you can set values that trigger events at both the Warning and Critical levels.

If you set both thresholds, the Absolute Threshold setting is used.

Configuring Activity Monitoring

The Activity Monitor monitors the MapReduce MRv1 jobs running on your cluster. This also includes the higher-level activities, such as Pig, Hive, and Oozie workflows that run as MapReduce tasks.

You can monitor for slow-running jobs or jobs that fail, and alert on these events. To detect jobs that are running too slowly, you must configure a set of activity duration rules that specify what jobs to monitor, and what the limits on duration are for those jobs. A "slow activity" event occurs when a job exceeds the duration limit configured for it in an activity duration rule. Activity duration rules are not defined by default; you must configure these rules if you want to see events for jobs that exceed the duration defined by these rules.

To configure Activity Monitor settings:

Go to the MapReduce service.
Click the Configuration tab.
Select Scope > MapReduce service_name (Service-Wide).
Click the Monitoring category.
Specify one or more activity duration rules.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

Activity Duration Rules

An activity duration rule is a regular expression (used to match an activity name (that is, a Job ID)) combined with a run time limit which the job should not exceed. You can add as many rules as you like, one per line, in the Activity Duration Rules property.

The format of each rule is regex=number where the regex is a regular expression to match against the activity name, and number is the job duration limit, in minutes. When a new activity starts, each regex expression is tested against the name of the activity for a match.

The list of rules is tested in order, and the first match found is used. For example, if the rule set is:

foo=10
bar=20

any activity named "foo" would be marked slow if it ran for more than 10 minutes. Any activity named "bar" would be marked slow if it ran for more than 20 minutes.

Since Java regular expressions can be used, if the rule set is:

foo.*=10
bar=20

any activity with a name that starts with "foo" (for example, fool, food, foot) matches the first rule.

If there is no match for an activity, then that activity is not monitored for job duration. However, you can add a "catch-all" as the last rule that always matches any name:

foo.*=10
bar=20
baz=30
.*=60

In this case, any job that runs longer than 60 minutes is marked slow and generates an event.

Configuring YARN Application Monitoring

You can configure the visibility of the YARN application monitoring results.

Configuring Application Visibility

To configure whether admin and non-admin users can view all applications, only that user's applications, or no applications:

Go to the YARN service.
Click the Configuration tab.
Select Scope > YARN service_name (Service-Wide).
Click the Monitoring category.
Set the Applications List Visibility Settings properties for admin and non-admin users.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

Configuring Impala Query Monitoring

You can configure the visibility of the Impala query results and the size of the storage allocated to Impala query results.

Configuring Query Visibility

To configure whether admin and non-admin users can view all queries, only that user's queries, or no queries:

Go to the Impala service.
Click the Configuration tab.
Select Scope > Impala service_name (Service-Wide).
Click the Monitoring category.
Set the Visibility Settings properties for admin and non-admin users.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

Configuring Impala Query Data Store Maximum Size

The query store stores enough information to make the query searchable through the filter language.

Do one of the following:
- Select Clusters > Cloudera Management Service.
- On the Home > Status tab, in Cloudera Management Service table, click the Cloudera Management Service link.
Click the Configuration tab.
Select Scope > Service Monitor.
Click the Main category.
In the Impala Storage section, set the firehose_impala_storage_bytes property. The default is 1 GiB.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

The firehose_impala_storage_bytes property determines the approximate amount of disk space dedicated to storing Impala query data. Once the store reaches its maximum size, older data is deleted to make room for newer queries. The disk usage is approximate because data deletion begins only when the limit has been reached.

Configuring Alerts

The following topics describe how to configure when alerts are raised and how they are delivered:

Enabling Activity Monitor Alerts
Enabling Configuration Change Alerts
Enabling HBase Alerts
Configuring Health Alerts
Configuring Log Alerts
Configuring Alert Delivery

Enabling Activity Monitor Alerts

You can enable alerts when an activity runs too slowly or fails.

Go to the MapReduce service.
Click the Configuration tab.
Select Scope > MapReduce service_name (Service-Wide).
Click the Monitoring category.
Check the Alert on Slow Activities or Alert on Activity Failure checkboxes.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

Enabling Configuration Change Alerts

Configuration change alerts can be set service wide, or on specific roles for the service.

Click a service, role, or host.
Click the Configuration tab.
Select Scope > All.
Click the Monitoring category.
Check the Enable Configuration Change Alerts checkbox.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

Enabling HBase Alerts

Go to the HBase service.
Click the Configuration tab.
Select Scope > HBase service_name (Service-Wide).
Click the Monitoring category.
Set one of the region or Hbck alerts:
- Hbck Region Error Count
- Hbck Error Count
- Hbck Alert Error Codes
- Hbck Slow Run
- Region Health Canary Slow Run
- Canary Unhealthy Region Count
- Canary Unhealthy Region Percentage
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

Configuring Health Alerts

Enabling Health Alerts

You can enable alerts when the health of a role or service crosses a threshold.

Select Clusters > cluster_name > service_name or open the page for a role.
Click the Configuration tab.
Select Scope > role_name or service_name (Service-Wide).
Click the Monitoring category.
Check the Enable Health Alerts for this Role or Enable Service Level Health Alerts checkbox, depending on whether you are configuring a role or a service.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

Modifying the Health Threshold

You can configure the threshold when a health alert is raised.

Select Administration > Alerts.
Click to the right of Health Alert Threshold.
Select Scope > Event Server.
Click the Main category.
Select the Bad or Concerning option.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

Configuring Alerts Transitioning Out of Alerting Health Threshold

You can configure an alert when a service or role instance transitions from an alerting to a non-alerting health threshold.

Select Administration > Alerts.
Click to the right of Alert on Transitions out of Alerting Health.
Select Scope > role_name or service_name (Service-Wide).
In the category Event Server Default Group, check the Alert on Transitions out of Alerting Health checkbox.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

Configuring Log Alerts

You can configure an alert when a daemon emits a log message that matches a specified regular expression. See Configuring Log Alerts.

Configuring Alert Delivery

You can configure alerts to be delivered by email or sent as SNMP traps. If you choose email delivery, you can add to or modify the list of alert recipient email addresses. You can also send a test alert email. See Managing Alerts.

Configuring Log Events

You can enable or disable the forwarding of selected log events to the Event Server. This is enabled by default, and is a service-wide setting (Enable Log Event Capture) for each service for which monitoring is provided. You can enable and disable event capture for CDH services or for the Cloudera Management Service.

Configuring Logs

Go to a service.
Click the Configuration tab.
Select role_name (Service-Wide) > Logs .
Edit a log property.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

Configuring Logging Thresholds

A logging threshold determines what level of log message is reported. The available levels are:

TRACE - Informational events finer-grained than DEBUG.
DEBUG - Informational events useful to debug an application.
INFO - Informational events that highlight progress at coarse-grained level.
WARN - Events that indicate a potential problem which is handled by the application.
ERROR - Error events that allows the application to continue running.
FATAL - Very severe error events that typically lead the application to abort.

The number of messages is greater and severity is least for TRACE. The default setting is INFO.

Go to a service.
Click the Configuration tab.
Enter Logging Threshold in the Search text field.
For the desired role group, select a logging threshold level.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

Configuring Log Directories

Do one of the following:
- Cluster:
  1. On the Home > Status tab, click a cluster name.
  2. Select Configuration > Log Directories.
  3. Edit a role_name Log Directory property.
- Service:
  1. Go to a service.
  2. Click the Configuration tab.
  3. Select role_name (Service-Wide) > Logs.
  4. Edit the Log Directory property.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

Enabling and Disabling Log Event Capture

Select Clusters > cluster_name > service_name.
Click the Configuration tab.
Select Scope > service_name (Service-Wide).
Click the Monitoring category.
Modify the Enable Log Event Capture setting.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

You can also modify the rules that determine how log messages are turned into events. Editing these rules is not recommended.

For each role, there are rules that govern how its log messages are turned into events by the custom log4j appender for the role. These are defined in the Rules to Extract Events from Log Files property.

Configuring Which Log Messages Become Events

Select Clusters > cluster_name > service_name.
Click the Configuration tab.
Enter Rules to Extract Events from Log Files in the Search text field.
Click the Monitoring category.
Select the role group for the role for which you want to configure log events, or search for "Rules to Extract Events from Log Files". Note that for some roles there may be more than one role group, and you may need to modify all of them. The easiest way to ensure that you have found all occurrences of the property you need to modify is to search for the property by name. Cloudera Manager shows all copies of the property that matches the search filter.
In the Content field, edit the rules as needed. Rules can be written as regular expressions.
Enter a Reason for change, and then click Save Changes to commit the changes.
Return to the Home page by clicking the Cloudera Manager logo.
Click the icon that is next to any stale services to invoke the cluster restart wizard.

Cloudera defines a number of rules by default. For example:

The line {"rate": 10, "threshold":"FATAL"}, means log entries with severity FATAL should be forwarded as events, up to 10 a minute.
The line {"rate": 0, "exceptiontype": "java.io.EOFException"}, means log entries with the exception java.io.EOFException should always be forwarded as an event.

The syntax for these rules is defined in the Description field for this property: the syntax lets you create rules that identify log messages based on log4j severity, message content matching, or the exception type. These rules must result in valid JSON.

Configuring Log Alerts

You specify that a log event should generate an alert (by setting "alert":true in the rule). If you specify a content match, the entire content must match — if you want to match on a partial string, you must provide wildcards as appropriate to allow matching the entire string.

Viewing Charts for Cluster, Service, Role, and Host Instances

Monitoring Clusters