Metrics

This section lists examples of the most commonly used metrics, their significance, and configuration changes to consider in response to metric variations. For the complete list, see Kafka Metrics.

Kafka Cluster Metrics

Metric Description Significance Action
Active Controllers

Shows a line for each broker that acted as an active controller during the charted time period.

A non-zero value indicates that the broker was the active controller during that time. When zoomed out to non-raw data, fractional values can occur during transitions between active controllers.

Some issues, such as failure of the Create Topic command, require that you check controller logs. Check the Active Controllers metric to see which broker was the controller when the issue occurred.
Total Messages Received Across Kafka Brokers Number of messages received from producers. This is an indicator of overall workload, based on the quantity of messages. Consider adding resources when workload approaches maximum capacity.
Total Bytes Received Across Kafka Brokers Amount of data broker received from producers.

This is an indicator of overall workload, based on the size of messages.

Consider adding resources when workload approaches maximum capacity.
Total Bytes Fetched Across Kafka Brokers Amount of data consumers read from broker. This is an indicator of overall workload, based on consumer demand. Consider adding resources when workload approaches maximum capacity.
Total Partitions Across Kafka Brokers Number of partitions (lead or follower replicas) on broker. Cloudera does not recommend more than 2000 partitions per broker. Consider adding additional brokers and rebalance partitions.
Total Leader Replicas Across Kafka Brokers Number of leader replicas on broker. Total Leader Replicas should be roughly the same in all brokers. If one broker has significantly more Lead Replicas, it might be overloaded (check network, cpu and disk metrics to see if this is the case). Set Enable automatic rebalancing of leadership to preferred replicas to true.
Total Offline Partitions Across Kafka Brokers The number of unavailable partitions. Offline partitions are not available for reading and writing. This can happen for several reasons (for example, when brokers for all available partitions are down). Restart the brokers, if needed, and check the logs for errors.
Total Under Replicated Partitions Across Kafka Brokers The number of partitions with unavailable replicas. Under-replicated partitions means that one or more replicas are not available. This is usually because a broker is down. Restart the broker, and check for errors in the logs.
Informational Events The number of informational events. An event is a record that something of interest has occurred – a service's health has changed state, a log message (of the appropriate severity) has been logged, and so on. Many events are enabled and configured by default. See Events. See Configuring Monitoring Settings.
Important Events and Alerts The number of recent alerts and important or critical events. An alert is an event that is considered especially noteworthy and is triggered by a selected event. Alerts are shown with an badge when they appear in a list of events. You can configure the Alert Publisher to send alert notifications by email or via SNMP trap to a trap receiver. See Alerts. See Managing Alerts.

Kafka Broker Metrics in Cloudera Manager

These metrics are tracked by default. You can add some or all of these metrics to the standard dashboard, or create a custom dashboard with only those items of particular interest. All of the metrics you can see at cluster level can also be shown at broker level.

Metric Description Significance Action
Health

The perentage of time this entity has spent in various health states. This chart can be used to see times in the past when the entity was healthy or unhealthy and to get a visual representation of the amount of time it was healthy or unhealthy.

Checks the amount of swap memory in use by the role. A failure of this health test might indicate that your machine is overloaded. Adjust Process Swap Memory Thresholds monitoring settings for this role instance.
Host Memory Usage Host memory usage, broken into various usage categories, including swap. The host's memory capacity is shown as a horizontal line near the top of the chart. An overcommitted host's usage extends past this line. Adjust Process Swap Memory Thresholds for this role instance.
Host Swap Rate

Host memory/disk swap rate.

In general, any swap is undesirable. Non-trivial swapping can lead to performance issues.

Adjust Process Swap Memory Thresholds for this role instance.

Host CPU Usage

Host CPU usage, broken into user and system usage.

  Adjust Cgroup CPU Shares for this Host instance.
Role CPU Usage

Role CPU usage, broken into user and system usage.

  Adjust Cgroup CPU Shares for this role instance.
Resident Memory

Resident memory in use.

  Set Cgroup Memory Soft Limit and Cgroup Memory Hard Limit to -1 to specify there is no limit. Consider adding resources ot the cluster.
Host Network Throughput The total network read and write I/O, across all of the host's network interfaces.   Consider adding resources to the host, or move partitions to a different broker.
Disk Latency

Latency statistics across each of the host's interfaces.

  Consider adding resources to the host, or move partitions to a different broker.
Aggregate Disk Throughput Total disk read and write I/O, across all of the host's disks.   Consider adding resources to the host, or move partitions to a different broker.
Aggregate Disk IOPS

Total disk read and write IOPS, across all of the host's disks.

  Consider adding resources to the host, or move partitions to a different broker.
ISR Expansions

Number of times In-Sync Replicas for a partition expanded.

If a broker goes down, ISR for some of the partitions shrink. When that broker is up again, ISRs are expanded once the replicas are fully caught up. Other than that, the expected value for ISR expansion rate is 0.

If ISR is expanding and shrinking frequently, adjust Allowed replica lag.

ISR Shrinks Number of times In-Sync Replicas for a partition shrank. If a broker goes down, ISR for some of the partitions shrink. When that broker is up again, ISRs are expanded once the replicas are fully caught up. Other than that, the expected value for ISR shrink rate is 0.

If ISR is expanding and shrinking frequently, adjust Allowed replica lag.