This is the documentation for Cloudera Manager 4.8.5.
Documentation for other versions is available at Cloudera Documentation.

Metric Aggregation

It is often useful to see an aggregated view of the activity on a cluster. For example, you might want to see the average number of bytes read per DataNode, or the maximum number of bytes read by any DataNode. To make this easy Cloudera Manager pre-aggregates many of these metrics and allow you to access them through charts.

What Metrics Are Aggregated

Cloudera Manager aggregates metrics based on the category of the entity that generated them. The categories map to components in the system such as hosts, disks, RegionServers, and HDFS services. Metrics are aggregated from their generating entity to larger entities of which they are a part. For example, metrics that are generated by disks, network interfaces, and file systems are aggregated to their respective hosts and clusters. Generally, this hierarchy is defined as follows:
  • Disk, network interface, file system - host, cluster
  • Host - cluster
  • Role - service, cluster
  • HTables - HBase service, cluster
  • Agents - Flume service, cluster
  • FlumeChannel, FlumeSource, FlumeSink - Flume service, cluster

Aggregate Types

Cloudera Manager supports five types of aggregate:
  • Maximum - the largest value for any entity
  • Minimum - the smallest value for any entity
  • Average - the average value for all entities
  • Standard deviation - the standard deviation of the values for all entities
  • Sum - the total of the value for all entities
Each aggregate is calculated every minute and takes into account all the metrics logged over the previous minute. For example, the metric cpu_percent_host_max takes into account all cpu_percent metrics logged by all hosts in a cluster in the previous minute.

Example Use Cases

Use Case 1: Compare the maximum, minimum, and average CPU usage across a cluster

  1. Select the Charts > Search.
  2. Enter the tsquery statement:
    SELECT cpu_percent_host_max, cpu_percent_host_min, cpu_percent_host_avg
  3. Click Search. You should see three charts, each with CPU data.
  4. Click Facets > All Combined in the left column. Now you should see all the data on one chart.

Use Case 2: Compare the CPU usage of a single host to the max, min, and average for the cluster

  1. Follow the instructions from Use Case 1, except in step 2 enter the following statement instead:
    SELECT cpu_percent_host_max, cpu_percent_host_min, cpu_percent_host_avg, cpu_percent where category=cluster or hostname='MYHOST.COM'

Aggregate Metric Names

To access aggregated metrics it helps to know how they are named. There are three components to the name:
  • The metric being aggregated - for example, cpu_percent or jvm_gc_count
  • The category of the entity generating the metric - for example, "host" or "RegionServer"
  • The aggregate type - for example, "max" or "avg"

These parts are combined to form a aggregate name such as "cpu_percent_host_max"

The name of the final component, aggregate type, varies by the type of the metric. Cloudera Manager support three types of metrics: gauges, weighted gauges, and counters.

Gauges

A gauge is a metric that can go up and down, such as cpu_percent. Gauges have a straightforward naming convention:
  • maximum - "max"
  • minimum - "min"
  • average - "avg"
  • standard deviation - "std_dev"
  • sum - "sum".

Weighted Gauges

A weighted gauge weighs a gauge by the number of counts of that gauge. Consider the HBase RegionServer metric put_avg_time. This metric tracks the average put time for each RegionServer. Now consider the case where you have two RegionServers, one that did 10,000 puts with an average time of one millisecond per put, and another that did 10 puts with an average time of one second per put. In this case if you just averaged the two averages, you would get that the average across the whole service was about half a second, but that doesn't accurately reflect reality.

Instead if you calculated the average by weighting the number of puts by the counter per RegionServer you would get a more accurate number:

Total puts = 10,000 + 10 = 10,010 puts

Total time = (10000 * 1ms) + (10 * 1000ms) = 20,000 ms

Average time = (20,000ms) / (10,010 puts) = ~2 ms

Weighted gauges perform this calculation. Their aggregates are named as follows:
  • maximum - "max"
  • minimum - "min"
  • average - "weighted_avg"
  • standard deviation - "weighted_std_dev"
  • sum - For weighted gauges sum aggregations represent the weighted total and are not an average. In our example the value would be 20,000 ms and the name would be put_time_regionserver_sum.

Counters

A counter is a metric that tracks the total count since a process or host started. An example of a counter is jvm_gc_count which tracks the number of Java garbage collections since a Java process started. Since users are more interested in the rate of change of counters (that is, how many garbage collections were there per second over the last five minutes) rather than their raw value Cloudera Manager calculates the aggregates in terms of rate. They are named as follows:
  • maximum - "max_rate"
  • minimum - "min_rate"
  • average - "avg_rate"
  • standard deviation - "std_dev_rate"
  • sum - For counters sum aggregations represent the total number of times an event occurred and are not a rate. In this case we append the word "sum" to the end of name. For example: jvm_gc_count_regionserver_sum.