Managing Metadata Storage with Purge

The volume of metadata maintained by Navigator Metadata Server can grow quickly and exceed the capacity of the Solr instance that processes the index, which can affect search results speed and time to display data lineage. In addition, stale metadata may show relationships that no longer exist, or the lineage may take longer to display than necessary as the system processes extraneous details.

Cloudera Navigator's purge function removes metadata for files that have been deleted or for operations that are older than the specified timeframe. The result is faster search and more precise (up-to-date) lineage diagrams.

In addition, purging before upgrading Cloudera Navigator to a new release can speed-up the upgrade process and reduce the chance of out-of-memory errors.

The purge function can be used in different ways:

Scheduling the Purge Process

Use the Cloudera Navigator console to configure a schedule for a regular weekly purge of deleted and stale metadata from the Navigator Metadata Server and its associated database.

To configure the automated purge schedule:
  1. Log in to the Cloudera Navigator console using an account with Full Administrator privileges.
  2. Go to Administration > Purge Settings tab.

    The current Metadata and Lineage purge schedule displays, along with lists of up to five upcoming scheduled purges and a list of up to five most recent completed purges.



To change the existing schedule:
  1. Click Edit.
  2. Set the purge process options.
    Option Default Range of selectable values and usage note
    How often Weekly Not configurable. The purge runs weekly per your specifications for Day and Time. It is enabled by default.
    Day Saturday Select a day for the purge that will have minimal impact to your user community.
    Time 12 Midnight Hourly time, from 12 Midnight through 11 PM. Select a time that will have minimal impact on production.
    Maximum purge duration 12 hours Set the amount of time you want to allow for the purge process to run. If not already complete, the HDFS purge process will not add any new items to purge after your specified duration. Entities purged to that point remain purged. All non-HDFS purge processes will run without limit. If set to 0, the purge is disabled.

    No other Cloudera Navigator operations, including through the console, can occur during the purge process.

    Purge HDFS entities deleted more than* 60 days The number of days after an entity is deleted that elapse until the purge process removes its metadata. For example, a setting of 1 day purges entities deleted before two days ago but retains entities deleted yesterday.
    Purge SELECT operations* Enabled Hive and Impala SELECT operations older than days specified in Only Purge SELECT operations older than will be purged.
    Purge operations older than* 60 days Yarn, Sqoop, and Pig operations older than the specified date will be purged. If Purge SELECT Operations is enabled, Hive and Impala SELECT operations older than the specified date will also be purged.
  3. Click Save when finished.

Here is an example of a revised schedule:



What Metadata is Purged?

Purge processes look for metadata that is associated with deleted files and tables and with operation executions that are older than the configured threshold date.

Hive Metadata
  • Hive operations
    • That don't produce output
    • That have all operation executions that were executed earlier than the threshold date
  • Hive operation executions
    • Associated with Hive operations that don't produce output
    • That were executed earlier than the threshold date
  • Hive sub-operations
    • Associated with Hive operations that were purged
  • All relations associated with the purged entities
Impala Metadata
  • Impala operations
    • That don't produce output
    • That have all operation executions that were executed earlier than the threshold date
  • Impala operation executions
    • Associated with Impala operations that don't produce output
    • That were executed earlier than the threshold date
  • Impala sub-operations
    • Associated with Impala operations that were purged
  • All relations associated with the purged entities
Sqoop Metadata
  • Sqoop import and export operations
    • That have all operation executions that were executed earlier than the threshold date and no existing downstream entities
  • Sqoop operation executions
    • that were executed earlier than the threshold date
  • All relations associated with the purged entities
YARN Metadata
  • YARN operations
    • That have all operation executions that were executed earlier than the threshold date
  • YARN operation executions
    • That were executed earlier than the threshold date
  • All relations associated with the purged entities
Pig Metadata
  • Pig operations
    • That have all operation executions that were executed earlier than the threshold date and don't apply to tables connected to existing HDFS files
  • Pig operation executions
    • That were executed earlier than the threshold date
  • Pig tables
    • That were created by an operation execution executed earlier than the threshold date and also not connected to an existing HDFS file
  • Pig fields
    • Fields in purged tables
  • All relations associated with the purged entities
HDFS Metadata
  • HDFS directories
    • That have been deleted longer than the configured threshold
    • AND don't have a logical-physical relation with another entity (such as a Hive table)
    • AND don't have children (sub directories or files) that aren't ready to be purged
  • HDFS files
    • Deleted file metadata are purged only when the containing directory is purged
  • All relations that have both endpoints associated with purged entities