Analyzing an Altus Job with Workload Analytics

The information that Workload Analytics collects and parses can be viewed on a set of pages within the Altus console. The pages show high-level information regarding health checks, execution details, and baselines. In addition, you can find more specific information, such as details for specific stages, error messages, logs, and configurations.

Use Workload Analytics to identify issues with a job, such as an execution failure due to bad SQL. After an issue is identified, you can clone the job from the Actions menu, take corrective action, and run the job again.

Health Checks

Health checks are a series of tests that Workload Analytics performs when a job ends. They provide insight into the performance of a job, such as how much data the job processed and how long it took.

You can view the full list of health checks that run and their status on the Health Checks page.

The Health Checks page for Workload Analytics uses three panes to display information for a job. The left pane lists the health checks that Workload Analytics performed. The middle pane shows the job stages that a specific health check inspected. By default, healthy stages are hidden. The right pane lists detailed information for the stage.



Execution Completion
Determines whether a job succeeded or failed. This health check only displays if a job fails.

Baseline

The health checks for baselines use information from previous runs of the same job to measure the performance of the current run of a job. If there is insufficient baseline data for a health check, the health check shows healthy. For more information about baselines, see Baseline.

Duration
Compares the completion time of the job to a baseline based on previous runs of the same job. A healthy status indicates that the difference in duration between the current job and baseline median is less than both 25% and five minutes.
Input Size
Compares the input for the current run of a job to a baseline for the job. A healthy status indicates that the difference in input data between the current job and baseline median is less than 25% and 100 MB.
To calculate input size, Workload Analytics uses the following metrics:
  • org.apache.hadoop.mapreduce.FileSystemCounter:HDFS_BYTES_READ
  • org.apache.hadoop.mapreduce.FileSystemCounter:S3A_BYTES_READ
  • SPARK:INPUT_BYTES
Output Size
Compares the output for the current run of a job to the baseline for the job. A healthy status indicates that the difference in output data between the current job and baseline median is less than 25% and 100 MB.
To calculate output size, Workload Analytics uses the following metrics:
  • org.apache.hadoop.mapreduce.FileSystemCounter:HDFS_BYTES_WRITTEN
  • org.apache.hadoop.mapreduce.FileSystemCounter:S3A_BYTES_WRITTEN
  • SPARK:OUTPUT_BYTES

Skew

The skew health checks compare the performance of tasks to other tasks within the same job. For optimal performance, tasks within the same job should perform the same amount of processing.

Task Duration
Compares the amount of time tasks take to finish their processing. A healthy status indicates that successful tasks took less than two standard deviations and less than five minutes from the average for all tasks. If the status is not healthy, try to configure the job so that processing is distributed evenly across tasks as a starting point.
Input Data
Compares the amount of input data that each task processed. A healthy status indicates that input data size is less than two standard deviations and 100 MB from the average amount of input data. If the status is not healthy,try to partition data so that each task processes a similar amount of input as a starting point.
Output Data
Compares the amount of output data that each task generated. A healthy status indicates that output data size is less than two standard deviations and 100 MB from the average amount of output data. If the status is not healthy, try partitioning data so that each task generates a similar amount of output as a starting point.
Shuffle Input
Compares the input size during the shuffle phase for tasks. A healthy status indicates that the shuffle phase input data size is less than two standard deviations and 100 MB from the average amount of shuffle phase input data. If the status is not healthy, try distributing input data so that tasks process similar amounts of data during the shuffle phase as a starting point.
Read Speed
Compares the data processing speed for each task. A healthy status indicates that the data processing speed for each task is less than two standard deviations from the average and less than 1 MB/s from the average.

Resources

The resource health checks determine whether the performance for tasks were impacted by insufficient resources.

Task Wait Time
Determines if some tasks took too long to start a successful attempt. A healthy status indicates that successful tasks took less than 15 minutes and less than 40% of total task duration time to start. Sufficient resources cut the run time of the job by lowering the maximum wait duration. If the status is not healthy, try giving more resources to the job by running it in resource pools with less contention or by adding more nodes to the cluster as a starting point.
Disk Spillage
Determines if tasks spilled too much data to disk and ran slowly as a result of the extra disk I/O. A healthy status indicates that the total number of spilled record is less than 1000 and that the number of spilled records divided by the number of output records is less than three. If the status is not healthy, try increasing the available memory as a starting point.
Task Garbage Collection (GC) Duration
Determines whether tasks spent more than 10 minutes performing garbage collection. Long garbage collection duration contributes to task duration and slows down the application. If the status is not healthy, try giving more memory to tasks or tune the garbage collection configuration for the application as a starting point.
Task Retries
Determines whether the number of failed task attempts exceeds 10% of the total number of tasks. Failed attempts need to be repeated, leading to poor performance and resource waste.

Execution Details

The Execution Details page displays a chronological list of all the stages for a job. In the left pane, you can view the time when a stage ran and its duration. In the right pane, you can view more detailed information, such as logs and configurations.

Baseline

Baselines provide a way to measure the current performance of a job against the average performance of previous runs. Baselines use performance data from the 30 most recent runs of a job and require a minimum of three runs. Baseline comparisons start with the fourth run of a job. When a baseline is created, there can be drastic differences when comparing runs to the baseline. As a baseline matures and more runs of a job are added to it, you can see a more established trend of what is normal for the job.

On the Baseline page you can compare the performance of the current job and its stages to the baseline for previous runs. Click the drop-down menu with the job name to select a specific stage.

Trend

The Trend page shows the following historical trends for a recurring job:
  • Duration
  • Data Processed
  • Data Generated
Additionally, you can see an overview of the previous runs of the job and basic information, such as their status and health issues.