Managing Metadata

This topic describes tasks for enabling and disabling metadata extraction and purging obsolete metadata.

Continue reading:

Enabling and Disabling Metadata Extraction
- Enabling Hive Metadata Extraction in a Secure Cluster
- Enabling Spark Metadata Extraction
Managing Metadata Capacity
Configuring Display of Inputs and Outputs

Enabling and Disabling Metadata Extraction

Minimum Required Role: Navigator Administrator (also provided by Full Administrator)

Enabling Hive Metadata Extraction in a Secure Cluster

The Navigator Metadata Server uses the hue user to connect to the Hive Metastore. The hue user can connect to the Hive Metastore by default. However, if the Hive service Hive Metastore Access Control and Proxy User Groups Override property or the HDFS service Hive Proxy User Groups property have been changed from their default values to settings that prevent the hue user from connecting to the Hive Metastore, Navigator Metadata Server cannot extract metadata from Apache Hive. If this is the case, modify the Hive service Hive Metastore Access Control and Proxy User Groups Override property or the HDFS service Hive Proxy User Groups property as follows:

Go to the Hive or HDFS service.
Click the Configuration tab.
In the Search box, type proxy.
In the Hive service Hive Metastore Access Control and Proxy User Groups Override or the HDFS service Hive Proxy User Groups property, click to add a new row.
If more than one role group applies to this configuration, edit the value for the appropriate role group. See Modifying Configuration Properties Using Cloudera Manager.
Type hue.
Click Save Changes to commit the changes.
Restart the service.

Enabling Spark Metadata Extraction

Lineage diagrams for Spark are supported in CDH 5.11 and higher, with some restrictions as listed in Restrictions on Lineage for Spark. By default, Spark metadata extraction is enabled. To control the status of Spark metadata extraction:

Search for the configuration setting config.navigator.lineage_enabled.
Check or uncheck the checkbox as appropriate.
In Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties, remove any setting for the property:
```
nav.spark.extraction.enable
```
This prior method of enabling metadata extraction for Spark is now deprecated.
Click Save Changes to commit the changes.
Restart the role.

Managing Metadata Capacity

Minimum Required Role: Full Administrator

The metadata maintained by Navigator Metadata Server can grow rapidly and exceed the capacity of the Solr instance storing the data. Navigator Metadata Server purge allows you to delete unwanted metadata to improve performance and reduce noise during search and lineage. Currently, purge is available only through the Metadata Server API.

Purging Metadata for HDFS Entities, Hive and Impala Select Queries, and YARN, Sqoop, and Pig Operations

You can delete metadata for HDFS entities, Hive and Impala select queries, and YARN, Sqoop and Pig operations by using the purge method. (Metadata for Hive tables is not deleted.) Purge is a long-running task that requires exclusive access to the Solr instance and does not allow any concurrent activities, including extraction.

To purge metadata, do the following:

Back up the Navigator Metadata Server storage directory.

Invoke the http://Navigator_Metadata_Server_host:port/api/v10/maintenance/purge endpoint with the following parameters:

Purge Parameters
Metadata	Parameter	Description
HDFS	`deleteTimeThresholdMinutes`	After an HDFS entity is deleted, the number of minutes that must pass before that entity can be purged. Default: 86400 minutes (60 days).
HDFS	`runtimeCapMinutes`	Number of minutes that the HDFS purge can run. When this limit is reached, the purge state is saved and the purge task terminates. You must run the purge again to purge any remaining entities. If you set the value to 0, no HDFS files or directories are purged. Default: 720 minutes (12 hours).
Hive and Impala Select Queries; YARN, Sqoop, Pig Operations	`deleteSelectOperations`	Boolean. If set to true, the purge deletes all Hive and Impala select queries, and YARN, Sqoop, and Pig operations, that are older than the number of days defined by the `staleQueryThresholdDays` value. Default: false
Hive and Impala Select Queries; YARN, Sqoop, Pig Operations	`staleQueryThresholdDays`	For Hive and Impala select queries, and YARN, Sqoop, and Pig operations, the number of days they must be older than to be purged. To disable purge for Hive and Impala select queries, and for YARN, Sqoop, and Pig operations, set the threshold to a very large value, for example, 36500. Default: 60 days

For example, the following call purges the metadata of all deleted HDFS entities because the elapsed minutes value is set to 0:

$ curl -X POST -u admin:admin "http://Navigator_Metadata_Server_host:port/api/v10/maintenance/purge?deleteTimeThresholdMinutes=0"

Purge tasks do not start until all currently running extraction tasks finish.

When all tasks have completed, click Continue to return to the Cloudera Navigator UI.

Retrieving Purge Status

To view the status of the purge process, invoke the http://Navigator_Metadata_Server_host:port/api/v10/maintenance/running endpoint. For example:

curl -X GET -u admin:admin "http://Navigator_Metadata_Server_host:port/api/v10/maintenance/running"

A result would look similar to:

[{
  "id" : 5,
  "type" : "PURGE",
  "startTime" : "2016-03-10T23:17:41.884Z",
  "endTime" : "1970-01-01T00:00:00.000Z",
  "status" : "IN_PROGRESS",
  "message" : "Purged 2661984 out of 4864714 directories. Averaging 1709 directories per minute.",
  "username" : "admin",
  "stage" : "HDFS_DIRECTORIES",
  "stagePercent" : 54
}]

Retrieving Purge History

To view the purge history, invoke the http://Navigator_Metadata_Server_host:port/api/v10/maintenance/history endpoint with the following parameters:

History Parameters
Parameter	Description
`offset`	First purge history entry to retrieve. Default: 0.
`limit`	Number of history entries to retrieve from the offset. Default: 100.

For example:

curl -X GET -u admin:admin "http://Navigator_Metadata_Server_host:port/api/v10/maintenance/history?offset=0&limit=100"

A result would look similar to:

[
  {
    "id": 1,
    "type": "PURGE",
    "startTime": "2016-03-09T18:57:43.196Z",
    "endTime": "2016-03-09T18:58:33.337Z",
    "status": "SUCCESS",
    "username": "admin",
    "stagePercent": 0
  },
  {
    "id": 2,
    "type": "PURGE",
    "startTime": "2016-03-09T19:47:39.401Z",
    "endTime": "2016-03-09T19:47:40.841Z",
    "status": "SUCCESS",
    "username": "admin",
    "stagePercent": 0
  },
  {
    "id": 3,
    "type": "PURGE",
    "startTime": "2016-03-10T01:27:39.632Z",
    "endTime": "2016-03-10T01:27:46.809Z",
    "status": "SUCCESS",
    "username": "admin",
    "stagePercent": 0
  },
  {
    "id": 4,
    "type": "PURGE",
    "startTime": "2016-03-10T01:57:40.461Z",
    "endTime": "2016-03-10T01:57:41.174Z",
    "status": "SUCCESS",
    "username": "admin",
    "stagePercent": 0
  },
  {
    "id": 5,
    "type": "PURGE",
    "startTime": "2016-03-10T23:17:41.884Z",
    "endTime": "2016-03-10T23:18:06.802Z",
    "status": "SUCCESS",
    "username": "admin",
    "stagePercent": 0
  }
]

Configuring Display of Inputs and Outputs

The entity Details page displays entity type-specific information, including table inputs and operation inputs and outputs. See Displaying Entity Details. In some cases, displaying inputs and outputs can delay rendering of the Details page. By default, displaying inputs and outputs is disabled. You can configure the Navigator Metadata Server to display inputs and outputs by setting the nav.ui.details_io_enabled property to true as follows:

Do one of the following:
- Select Clusters > Cloudera Management Service.
- On the Home > Status tab, in Cloudera Management Service table, click the Cloudera Management Service link.
Click the Configuration tab.
Select Scope > Navigator Metadata Server.
In Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties, set the property
```
nav.ui.details_io_enabled=true
```
If more than one role group applies to this configuration, edit the value for the appropriate role group. See Modifying Configuration Properties Using Cloudera Manager.
Click Save Changes to commit the changes.
Restart the role.

Categories: Administrators | Governance | Metadata | Navigator | All Categories

Managing Hive and Impala Lineage Properties

Managing Metadata Policies