Storage Space Planning for Cloudera Manager

Minimum Required Role: Full Administrator

Cloudera Manager tracks metrics of services, jobs, and applications in many background processes. All of these metrics require storage. Depending on the size of your organization, this storage can be local or remote, disk-based or in a database, managed by you or by another team in another location.

Most system administrators are aware of common locations like /var/log/ and the need for these locations to have adequate space. This topic helps you plan for the storage needs and data storage locations used by the Cloudera Manager Server and the Cloudera Management Service to store metrics and data.

Failing to plan for the storage needs of all components of the Cloudera Manager Server and the Cloudera Management Service can negatively impact your cluster in the following ways:

  • The cluster might not be able to retain historical operational data to meet internal requirements.
  • The cluster might miss critical audit information that was not gathered or retained for the required length of time.
  • Administrators might be unable to research past events or health status.
  • Administrators might not have historical MR1, YARN, or Impala usage data when they need to reference or report on them later.
  • There might be gaps in metrics collection and charts.
  • The cluster might experience data loss due to filling storage locations to 100% of capacity. The effects of such an event can impact many other components.

The main theme here is that you must architect your data storage needs well in advance. You must inform your operations staff about your critical data storage locations for each host so that they can provision your infrastructure adequately and back it up appropriately. Make sure to document the discovered requirements in your internal build documentation and run books.

This topic describes both local disk storage and RDBMS storage. This distinction is made both for storage planning and also to inform migration of roles from one host to another, preparing backups, and other lifecycle management events.

The following tables provide details about each individual Cloudera Management service to enable Cloudera Manager administrators to make appropriate storage and lifecycle planning decisions.

Cloudera Manager Server

Cloudera Manager Server
Configuration Topic Cloudera Manager Server Configuration
Default Storage Location RDBMS:

Any Supported RDBMS. For more information, see CDH and Cloudera Manager Supported Databases.

Disk:

Cloudera Manager Server Local Data Storage Directory (command_storage_path) on the host where the Cloudera Manager Server is configured to run. This local path is used by Cloudera Manager for storing data, including command result files. Critical configurations are not stored in this location.

Default setting: /var/lib/cloudera-scm-server/

Storage Configuration Defaults, Minimum, or Maximum There are no direct storage defaults relevant to this entity.
Where to Control Data Retention or Size The size of the Cloudera Manager Server database varies depending on the number of managed hosts and the number of discrete commands that have been run in the cluster. To configure the size of the retained command results in the Cloudera Manager Administration Console, select Administration > Settings and edit the following property:
Command Eviction Age
Length of time after which inactive commands are evicted from the database.

Default is two years.

Sizing, Planning & Best Practices The Cloudera Manager Server database is the most vital configuration store in a Cloudera Manager deployment. This database holds the configuration for clusters, services, roles, and other necessary information that defines a deployment of Cloudera Manager and its managed hosts.

Make sure that you perform regular, verified, remotely-stored backups of the Cloudera Manager Server database.

Cloudera Management Service

Cloudera Management Service - Activity Monitor Configuration
Configuration Topic Activity Monitor
Default Storage Location Any Supported RDBMS. For more information, see CDH and Cloudera Manager Supported Databases.
Storage Configuration Defaults / Minimum / Maximum Default: 14 Days worth of MapReduce (MRv1) jobs/tasks
Where to Control Data Retention or Size

You control Activity Monitor storage usage by configuring the number of days or hours of data to retain. Older data is purged.

To configure data retention in the Cloudera Manager Administration Console:
  1. Go the Cloudera Management Service.
  2. Click the Configuration tab.
  3. Select Scope > Activity Monitor or Cloudera Management Service (Service-Wide).
  4. Select Category > Main.
  5. Locate the following properties or search for them by typing the property name in the Search box:
    Purge Activities Data at This Age
    In Activity Monitor, purge data about MapReduce jobs and aggregate activities when the data reaches this age in hours. By default, Activity Monitor keeps data about activities for 336 hours (14 days).
    Purge Attempts Data at This Age
    In the Activity Monitor, purge data about MapReduce attempts when the data reaches this age in hours. Because attempt data can consume large amounts of database space, you might want to purge it more frequently than activity data. By default, Activity Monitor keeps data about attempts for 336 hours (14 days).
    Purge MapReduce Service Data at This Age
    The number of hours of past service-level data to keep in the Activity Monitor database, such as total slots running. The default is to keep data for 336 hours (14 days).
  6. Click Save Changes to commit the changes.
Sizing, Planning, and Best Practices

The Activity Monitor only monitors MapReduce jobs, and does not monitor YARN applications. If you no longer use MapReduce (MRv1) in your cluster, the Activity Monitor is not required for Cloudera Manager 5 (or higher) or CDH 5 (or higher).

The amount of storage space needed for 14 days worth of MapReduce activities can vary greatly and directly depends on the size of your cluster and the level of activity that uses MapReduce. It might be necessary to adjust and readjust the amount of storage as you determine the "stable state" and "burst state" of the MapReduce activity in your cluster.

For example, consider the following test cluster and usage:

  • A simulated 1000-host cluster, each host with 32 slots
  • MapReduce jobs with 200 attempts (tasks) per activity (job)

Sizing observations for this cluster:

  • Each attempt takes 10 minutes to complete.
  • This usage results in roughly 20 thousand jobs a day with approximately 5 million total attempts.
  • For a retention period of 7 days, this Activity Monitor database required 200 GB.
Cloudera Management Service - Service Monitor Configuration
Configuration Topic Service Monitor Configuration
Default Storage Location /var/lib/cloudera-service-monitor/ on the host where the Service Monitor role is configured to run.
Storage Configuration Defaults / Minimum / Maximum
  • 10 GiB Services Time Series Storage
  • 1 GiB Impala Query Storage
  • 1 GiB YARN Application Storage

Total: ~12 GiB Minimum (No Maximum)

Where to Control Data Retention or Size

Service Monitor data growth is controlled by configuring the maximum amount of storage space it can use.

To configure data retention in Cloudera Manager Administration Console:

  1. Go the Cloudera Management Service.
  2. Click the Configuration tab.
  3. Select Scope > Service Monitor or Cloudera Management Service (Service-Wide).
  4. Select Category > Main.
  5. Locate the propertyName property or search for it by typing its name in the Search box.
    Time-Series Storage

    The approximate amount of disk space dedicated to storing time series and health data. When the store has reached its maximum size, it deletes older data to make room for newer data. The disk usage is approximate because the store only begins deleting data when it reaches the limit.

    Note that Cloudera Manager stores time-series data at a number of different data granularities, and these granularities have different effective retention periods. The Service Monitor stores metric data not only as raw data points but also as ten-minute, hourly, six-hourly, daily, and weekly summary data points. Raw data consumes the bulk of the allocated storage space and weekly summaries consume the least. Raw data is retained for the shortest amount of time while weekly summary points are unlikely to ever be deleted.

    Select Cloudera Management Service > Charts Library tab in Cloudera Manager for information about how space is consumed within the Service Monitor. These pre-built charts also show information about the amount of data retained and time window covered by each data granularity.

    Impala Storage

    The approximate amount of disk space dedicated to storing Impala query data. When the store reaches its maximum size, it deletes older data to make room for newer queries. The disk usage is approximate because the store only begins deleting data when it reaches the limit.

    YARN Storage

    The approximate amount of disk space dedicated to storing YARN application data. When the store reaches its maximum size, it deletes older data to make room for newer applications. The disk usage is approximate because Cloudera Manager only begins deleting data when it reaches the limit.

  6. Click Save Changes to commit the changes.
Sizing, Planning, and Best Practices The Service Monitor gathers metrics about configured roles and services in your cluster and also runs active health tests. These health tests run regardless of idle and use periods, because they are always relevant. The Service Monitor gathers metrics and health test results regardless of the level of activity in the cluster. This data continues to grow, even in an idle cluster.
Cloudera Management Service - Host Monitor
Configuration Topic Host Monitor Configuration
Default Storage Location /var/lib/cloudera-host-monitor/ on the host where the Host Monitor role is configured to run.
Storage Configuration Defaults / Minimum/ Maximum Default (and minimum): 10 GiB Host Time Series Storage
Where to Control Data Retention or Size Host Monitor data growth is controlled by configuring the maximum amount of storage space it can use.

See Data Storage for Monitoring Data.

To configure these data retention configuration properties in the Cloudera Manager Administration Console:
  1. Go the Cloudera Management Service.
  2. Click the Configuration tab.
  3. Select Scope > Host Monitor or Cloudera Management Service (Service-Wide).
  4. Select Category > Main.
  5. Locate each property or search for it by typing its name in the Search box.
    Time-Series Storage

    The approximate amount of disk space dedicated to storing time series and health data. When the store reaches its maximum size, it deletes older data to make room for newer data. The disk usage is approximate because the store only begins deleting data when it reaches the limit.

    Note that Cloudera Manager stores time-series data at a number of different data granularities, and these granularities have different effective retention periods. Host Monitor stores metric data not only as raw data points but also as summaries of ten minute, one hour, six hour, one day, and one week increments. Raw data consumes the bulk of the allocated storage space and weekly summaries consume the least. Raw data is retained for the shortest amount of time, while weekly summary points are unlikely to ever be deleted.

    See the Cloudera Management Service > Charts Library tab in Cloudera Manager for information on how space is consumed within the Host Monitor. These pre-built charts also show information about the amount of data retained and the time window covered by each data granularity.

  6. Click Save Changes to commit the changes.
Sizing, Planning and Best Practices The Host Monitor gathers metrics about host-level items of interest (for example: disk space usage, RAM, CPU usage, swapping, etc) and also informs host health tests. The Host Monitor gathers metrics and health test results regardless of the level of activity in the cluster. This data continues to grow fairly linearly, even in an idle cluster.
Cloudera Management Service - Event Server
Configuration Topic Event Server Configuration
Default Storage Location /var/lib/cloudera-scm-eventserver/ on the host where the Event Server role is configured to run.
Storage Configuration Defaults 5,000,000 events retained
Where to Control Data Retention or Minimum /Maximum

The amount of storage space the Event Server uses is influenced by configuring how many discrete events it can retain.

To configure data retention in Cloudera Manager Administration Console,
  1. Go the Cloudera Management Service.
  2. Click the Configuration tab.
  3. Select Scope > Event Server or Cloudera Management Service (Service-Wide).
  4. Select Category > Main.
  5. Edit the following property:
    Maximum Number of Events in the Event Server Store

    The maximum size of the Event Server store, in events. When this size is exceeded, events are deleted starting with the oldest first until the size of the store is below this threshold

  6. Click Save Changes to commit the changes.
Sizing, Planning, and Best Practices

The Event Server is a managed Lucene index that collects relevant events that happen within your cluster, such as results of health tests, log events that are created when a log entry matches a set of rules for identifying messages of interest and makes them available for searching, filtering and additional action. You can view and filter events on the Diagnostics > Events tab of the Cloudera Manager Administration Console. You can also poll this data using the Cloudera Manager API.

Cloudera Management Service - Reports Manager
Configuration Topic Reports Manager Configuration
Default Storage Location RDBMS:

Any Supported RDBMS. For more information, see CDH and Cloudera Manager Supported Databases.

Disk:

/var/lib/cloudera-scm-headlamp/ on the host where the Reports Manager role is configured to run.

Storage Configuration Defaults

RDBMS:

There are no configurable parameters to directly control the size of this data set.

Disk:

There are no configurable parameters to directly control the size of this data set. The storage utilization depends not only on the size of the HDFS fsimage, but also on the HDFS file path complexity. Longer file paths contribute to more space utilization.

Where to Control Data Retention or Minimum / Maximum

The Reports Manager uses space in two main locations: on the Reports Manager host and on its supporting database. Cloudera recommends that the database be on a separate host from the Reports Manager host for process isolation and performance.

Sizing, Planning, and Best Practices Reports Manager downloads the fsimage from the NameNode (every 60 minutes by default) and stores it locally to perform operations against, including indexing the HDFS filesystem structure. More files and directories results in a larger fsimage, which consumes more disk space.

Reports Manager has no control over the size of the fsimage. If your total HDFS usage trends upward notably or you add excessively long paths in HDFS, it might be necessary to revisit and adjust the amount of local storage allocated to the Reports Manager. Periodically monitor, review, and adjust the local storage allocation.

Cloudera Navigator

Cloudera Navigator - Navigator Audit Server
Configuration Topic Navigator Audit Server Configuration
Default Storage Location Any Supported RDBMS. For more information, see CDH and Cloudera Manager Supported Databases.
Storage Configuration Defaults Default: 90 Days retention
Where to Control Data Retention or Min/Max Navigator Audit Server storage usage is controlled by configuring how many days of data it can retain. Any older data is purged.

To configure data retention in the Cloudera Manager Administration Console:

  1. Go the Cloudera Management Service.
  2. Click the Configuration tab.
  3. Select Scope > Navigator Audit Server or Cloudera Management Service (Service-Wide).
  4. Select Category > Main.
  5. Locate the Navigator Audit Server Data Expiration Period property or search for it by typing its name in the Search box.
    Navigator Audit Server Data Expiration Period
    In Navigator Audit Server, purge audit data of various auditable services when the data reaches this age in days. By default, Navigator Audit Server keeps data about audits for 90 days.
  6. Click Save Changes to commit the changes.
Sizing, Planning, and Best Practices The size of the Navigator Audit Server database directly depends on the number of audit events the cluster’s audited services generate. Normally the volume of HDFS audits exceeds the volume of other audits (all other components like MRv1, Hive and Impala read from HDFS, which generates additional audit events).

The average size of a discrete HDFS audit event is ~1 KB. For a busy cluster of 50 hosts with ~100K audit events generated per hour, the Navigator Audit Server database would consume ~2.5 GB per day. To retain 90 days of audits at that level, plan for a database size of around 250 GB. If other configured cluster services generate roughly the same amount of data as the HDFS audits, plan for the Navigator Audit Server database to require around 500 GB of storage for 90 days of data.

Notes:

  • Individual Hive and Impala queries themselves can be very large. Since the query itself is part of an audit event, such audit events consume space in proportion to the length of the query.
  • The amount of space required increases as activity on the cluster increases. In some cases, Navigator Audit Server databases can exceed 1 TB for 90 days of audit events. Benchmark your cluster periodically and adjust accordingly.

To map Cloudera Navigator versions to Cloudera Manager versions, see Product Compatibility Matrix for Cloudera Navigator.

Cloudera Navigator - Navigator Metadata Server
Configuration Topic Navigator Metadata Server Configuration
Default Storage Location

RDBMS:

Any Supported RDBMS. For more information, see CDH and Cloudera Manager Supported Databases.

Disk:

/var/lib/cloudera-scm-navigator/ on the host where the Navigator Metadata Server role is configured to run.

Storage Configuration Defaults

RDBMS:

There are no exposed defaults or configurations to directly cull or purge the size of this data set.

Disk:

There are no configuration defaults to influence the size of this location. You can change the location itself with the Navigator Metadata Server Storage Dir property. The size of the data in this location depends on the amount of metadata in the system (HDFS fsimage size, Hive Metastore size) and activity on the system (the number of MapReduce Jobs run, Hive queries executed, etc).

Where to Control Data Retention or Min/Max

RDBMS:

The Navigator Metadata Server database should be carefully tuned to support large volumes of metadata.

Disk:

The Navigator Metadata Server index (an embedded Solr instance) can consume lots of disk space at the location specified for the Navigator Metadata Server Storage Dir property. Ongoing maintenance tasks include purging metadata from the system.

Sizing, Planning, and Best Practices

Memory:

See Navigator Metadata Server Tuning.

RDBMS:

The database is used to store policies and authorization data. The dataset is small, but this database is also used during a Solr schema upgrade, where Solr documents are extracted and inserted again in Solr. This has same space requirements as above use case, but the space is only used temporarily during product upgrades.

Use the Product Compatibility Matrix for Cloudera Navigator product compatibility matrix to map Cloudera Navigator and Cloudera Manager versions.

Disk:

This filesystem location contains all the metadata that is extracted from managed clusters. The data is stored in Solr, so this is the location where Solr stores its index and documents. Depending on the size of the cluster, this data can occupy tens of gigabytes. A guideline is to look at the size of HDFS fsimage and allocate two to three times that size as the initial size. The data here is incremental and continues to grow as activity is performed on the cluster. The rate of growth can be on order of tens of megabytes per day.

General Performance Notes

When possible:

  • For entities that use an RDBMS, install the database on a separate host from the service, and consolidate roles that use databases on as few servers as possible.

  • Provide a dedicated spindle to the RDBMS or datastore data directory to avoid disk contention with other read/write activity.

Cluster Lifecycle Management with Cloudera Manager

Cloudera Manager clusters that use parcels to provide CDH and other components require adequate disk space in the following locations:

Parcel Lifecycle Management
Parcel Lifecycle Path (default) Notes
Local Parcel Repository Path (/opt/cloudera/parcel-repo)

This path exists only on the host where Cloudera Manager Server (cloudera-scm-server) runs. The Cloudera Manager Server stages all new parcels in this location as it fetches them from any external repositories. Cloudera Manager Agents are then instructed to fetch the parcels from this location when the administrator distributes the parcel using the Cloudera Manager Administration Console or the Cloudera Manager API.

Sizing and Planning

The default location is /opt/cloudera/parcel-repo but you can configure another local filesystem location on the host where Cloudera Manager Server runs. See Parcel Configuration Settings.

Provide sufficient space to hold all the parcels you download from all configured Remote Parcel Repository URLs (See Parcel Configuration Settings). Cloudera Manager deployments that manage multiple clusters store all applicable parcels for all clusters.

Parcels are provided for each operating system, so be aware that heterogeneous clusters (distinct operating systems represented in the cluster) require more space than clusters with homogeneous operating systems.

For example, a cluster with both RHEL6.x and 7.x hosts must hold -el6 and -el7 parcels in the Local Parcel Repository Path, which requires twice the amount of space.

Lifecycle Management and Best Practices

Delete any parcels that are no longer in use from the Cloudera Manager Administration Console, (never delete them manually from the command line) to recover disk space in the Local Parcel Repository Path and simultaneously across all managed cluster hosts which hold the parcel.

Backup Considerations

Perform regular backups of this path, and consider it a non-optional accessory to backing up Cloudera Manager Server. If you migrate Cloudera Manager Server to a new host or restore it from a backup (for example, after a hardware failure), recover the full content of this path to the new host, in the /opt/cloudera/parcel-repo directory before starting any cloudera-scm-agent or cloudera-scm-server processes.

Parcel Cache (/opt/cloudera/parcel-cache)

Managed Hosts running a Cloudera Manager Agent stage distributed parcels into this path (as .parcel files, unextracted). Do not manually manipulate this directory or its files.

Sizing and Planning

Provide sufficient space per-host to hold all the parcels you distribute to each host.

You can configure Cloudera Manager to remove these cached .parcel files after they are extracted and placed in /opt/cloudera/parcels/. It is not mandatory to keep these temporary files but keeping them avoids the need to transfer the .parcel file from the Cloudera Manager Server repository should you need to extract the parcel again for any reason.

To configure this behavior in the Cloudera Manager Administration Console, select Administration > Settings > Parcels > Retain Downloaded Parcel Files

Host Parcel Directory (/opt/cloudera/parcels)

Managed cluster hosts running a Cloudera Manager Agent extract parcels from the /opt/cloudera/parcel-cache directory into this path upon parcel activation. Many critical system symlinks point to files in this path and you should never manually manipulate its contents.

Sizing and Planning

Provide sufficient space on each host to hold all the parcels you distribute to each host. Be aware that the typical CDH parcel size is approximately 2 GB per parcel, and some third party parcels can exceed 3 GB. If you maintain various versions of parcels staged before and after upgrading, be aware of the disk space implications.

You can configure Cloudera Manager to automatically remove older parcels when they are no longer in use. As an administrator you can always manually delete parcel versions not in use, but configuring these settings can handle the deletion automatically, in case you forget.

To configure this behavior in the Cloudera Manager Administration Console, select Administration > Settings > Parcels and configure the following property:

Automatically Remove Old Parcels

This parameter controls whether parcels for old versions of an activated product should be removed from a cluster when they are no longer in use.

The default value is Disabled.

Number of Old Parcel Versions to Retain

If you enable Automatically Remove Old Parcels, this setting specifies the number of old parcels to keep. Any old parcels beyond this value are removed. If this property is set to zero, no old parcels are retained.

The default value is 3.

Management Service Lifecycle - Space Reclamation Tasks
Task Description
Activity Monitor (One-time)

The Activity Monitor only works against a MapReduce (MR1) service, not YARN. So if your deployment has fully migrated to YARN and no longer uses a MapReduce (MR1) service, your Activity Monitor database is no longer growing. If you have waited longer than the default Activity Monitor retention period (14 days) to address this point, then the Activity Monitor has already purged it all for you and your database is mostly empty. If your deployment meets these conditions, consider cleaning up by dropping the Activity Monitor database (again, only when you are satisfied that you no longer need the data or have confirmed that it is no longer in use) and the Activity Monitor role.

Service Monitor and Host Monitor (One-time)

For those who used Cloudera Manager version 4.x and have now upgraded to version 5.x: The Service Monitor and Host Monitor were migrated from their previously-configured RDBMS into a dedicated time series store used solely by each of these roles respectively. After this happens, there is still legacy database connection information in the configuration for these roles. This was used to allow for the initial migration but is no longer being used for any active work.

After the above migration has taken place, the RDBMS databases previously used by the Service Monitor and Host Monitor are no longer used. Space occupied by these databases is now recoverable. If appropriate in your environment (and you are satisfied that you have long-term backups or do not need the data on disk any longer), you can drop those databases.

Ongoing Space Reclamation

Cloudera Management Services are automatically rolling up, purging or otherwise consolidating aged data for you in the background. Configure retention and purging limits per-role to control how and when this occurs. These configurations are discussed per-entity above. Adjust the default configurations to meet your space limitations or retention needs.

Log Files

All CDH cluster hosts write out separate log files for each role instance assigned to the host. Cluster administrators can monitor and manage the disk space used by these roles and configure log rotation to prevent log files from consuming too much disk space.

For more information, see Managing Disk Space for Log Files.

Conclusion

Keep this information in mind for planning and architecting the deployment of a cluster managed by Cloudera Manager. If you already have a live cluster, this lifecycle and backup information can help you keep critical monitoring, auditing, and metadata sources safe and properly backed up.