Applying Metadata to HDFS and Hive Entities using the API

Using the Navigator API and JSON formatted metadata definition files, entities can be assigned properties in bulk, prior to extraction.

Metadata Definition Files

You can add tags and properties to HDFS entities using metadata files. With metadata files, you can assign metadata to entities in bulk and create metadata before it is extracted. A metadata file is a JSON file with the following structure:

{
  "name" : "name",
  "description" : "description",
  "properties" : {
    "key_name_1" : "value_1", 
        "key_name_2" : "value_2"
  },
  "tags" : [ "tag_1" ]
}
To add metadata files to files and directories, create a metadata file with the extension .navigator, naming the files as follows:
  • File - The path of the metadata file must be .filename.navigator. For example, to apply properties to the file /user/test/file1.txt, the metadata file path is /user/test/.file1.txt.navigator.
  • Directory - The path of the metadata file must be dirpath/.navigator. For example, to apply properties to the directory /user, the metadata path must be /user/.navigator.
The metadata file is applied to the entity metadata when the extractor runs.

Applying HDFS and Hive Metadata

The Navigator APIs can be used modify metadata of HDFS or Hive entities, such as databases, tables, and operations, before or after entity extraction.

  • If an entity has been extracted when the API is called, the metadata is applied immediately.
  • If the entity has not been extracted, you can preregister metadata, which is then applied once the entity is extracted.

The Navigator SDK includes examples of pre-registering entities. For example, see this example of creating a Hive operation, which will allow you to see lineage for pre-registered entities.

Metadata is saved regardless of whether or not a matching entity is extracted. Cloudera Navigator does not perform any cleanup of unused metadata.

If you call the API before the entity is extracted, the custom metadata is stored with the entity's:
  • Identity
  • Source ID
  • Metadata fields (name, description, tags, properties)
  • Fields relevant to the identifier
Other fields (attributes) for the entity, such as Type, are not present.
To view all stored metadata, use the API to search for entities without an internal type, specifically the entities endpoint with the GET method:
http://fqdn-n.example.com:port/api/APIversion/entities/ 

where fqdn-n.example.com is the host running the Navigator Metadata Server role instance listening for HTTP connections at the specified port number (7187 is the default port number). APIversion is the running version of the API as indicated in the footer of the API documentation (available from the Help menu in the Navigator console) or by calling http://fqdn-n.example.com:port/api/version.

For example:

curl http://node1.example.com:7187/api/v13/entities/?query=-internalType:* \
-u username:password -X GET

User-defined metadata provided through the API overwrites existing metadata. For example, passing empty name and description fields with an empty array for tags and empty property dictionary with the API call removes the existing metadata. If you omit the tags or properties fields, the existing values remain unchanged. If you want to add a tag to a list of existing tags, you must include the existing tags in your update.

Modifying custom metadata using metadata files and the metadata API at the same time is not supported. You must use one or the other, because the two methods work differently.

Metadata specified in JSON files is merged with existing metadata, whereas the API overwrites metadata. Also, the updates provided by metadata files wait in a queue before being merged, but API changes are committed immediately. Some inconsistency can occur if a metadata file is merged when the API is in use.

Metadata is modified using either the PUT or POST method. Use the PUT method if the entity has been extracted, and the POST method to preregister metadata. Use the following syntax:
  • PUT
    curl http://fqdn-n.example.com:port/api/APIversion/entities/identity \
    -u username:password \
    -X PUT \
    -H "Content-Type: application/json" \
    -d '{properties}'
    where identity is an entity ID and properties are:
    • name - Name metadata.
    • description - Description metadata.
    • tags - Tag metadata.
    • properties - Custom metadata properties. The format is {key: value}.
    • customProperties - Managed metadata properties. The format is {namespace: {key: value}}. If a property is assigned a value that does not conform to type constraints, an error is returned.
    All existing naming rules apply, and if any value is invalid, the entire request is denied.
  • POST
    curl http://fqdn-n.example.com:port/api/APIversion/entities/ \
    -u username:password \
    -X POST \
    -H "Content-Type: application/json" \
    -d '{properties}'
    where properties are:
    • sourceId (required) - An existing source ID. After the first extraction, you can retrieve source IDs using the call:
      curl http://fqdn-n.example.com:port/api/APIversion/entities/?query=type:SOURCE \
      -u username:password -X GET
      For example:
      [ ...
      {  {
      "identity": "61cfefd303d4284b7f5014b701f2c76d",
      "originalName": "source.listing",
      "originalDescription": null,
      "sourceId": "012437f9eeb3c23dc69e679ac94a7fa2",
      "firstClassParentId": null,
      "parentPath": "/user/hdfs/.cm/distcp/2016-02-03_487",
      ...
      "properties": {
      "__cloudera_internal__hueLink":
      "http://fqdn-2.example.com:8888/filebrowser/#/user/hdfs/.cm/distcp/2016-02-03_487/source.listing"
       },
      "technicalProperties": null,
      "fileSystemPath": "/user/hdfs/.cm/distcp/2016-02-03_487/source.listing",
      "type": "FILE",
      "size": 92682,
      "created": "2016-02-03T21:12:16.587Z",
      "lastModified": "2016-02-03T21:12:16.587Z",
      "lastAccessed": "2016-02-03T21:12:16.587Z",
      "permissions": "rw-r--r--",
      "owner": "hdfs",
      "group": "supergroup",
      "blockSize": 134217728,
      "mimeType": "application/octet-stream",
      "replication": 3,
      "userEntity": false,
      "deleted": false,
      "sourceType": "HDFS",
      "metaClassName": "fselement",
      "packageName": "nav",
      "internalType": "fselement"
      }, ...
      If you have multiple services of a given type, you must specify the source ID that contains the entity you expect it to match.
    • parentPath - The path of the parent entity, defined as:
      • HDFS file or directory - fileSystemPath of the parent directory. (Do not provide this field if the entity affected is the root directory.) Example parentPath for /user/admin/input_dir: /user/admin. If you add metadata to a directory, the metadata does not propagate to any files or folders in that directory.
      • Hive database - If you are updating database metadata, do not specify this field.
      • Hive table or view - The name of database containing the table or view. Example for a table in the default database: default.
      • Hive column - database name/table name/view name. Example for a column in the sample_07 table: default/sample_07.
    • originalName (required) - The name as defined by the source system.
      • HDFS file or directory- Name of file or directory (ROOT if the entity is the root directory). Example originalName for /user/admin/input_dir: input_dir.
      • Hive database, table, view, or column - The name of the database, table, view, or column.
        • Example for default database: default
        • Example for sample_07 table: sample_07
    • identity
    • Metadata fields (name, description, tags, properties)
    • Fields relevant to the identifier
All existing naming rules apply, and if any value is invalid, the entire request is denied.

API Usage Examples

HDFS PUT Custom Metadata Example for /user/admin/input_dir Directory

curl http://node1.example.com:7187/api/v13/entities/e461de8de38511a3ac6740dd7d51b8d0 \
-u username:password \
-X PUT \
-H "Content-Type: application/json"\
-d '{"name":"my_name",
   "description":"My description",
   "tags":["tag1","tag2"],
   "properties":{"property1":"value1","property2":"value2"}}'

HDFS POST Custom Metadata Example for /user/admin/input_dir Directory

curl http://node1.example.com:7187/api/v13/entities/ \
-u username:password \
-X POST \
-H "Content-Type: application/json" \
-d '{"sourceId":"a09b0233cc58ff7d601eaa68673a20c6",
     "parentPath":"/user/admin",
     "originalName":"input_dir",
     "name":"my_name",
     "description":"My description",
     "tags":["tag1","tag2"],
     "properties":{"property1":"value1","property2":"value2"}}'

Hive POST Custom Metadata Example for total_emp Column

curl http://node1.example.com:7187/api/v13/entities/\
-u username:password \
-X POST \
-H "Content-Type: application/json" \
-d '{"sourceId":"4fbdadc6899638782fc8cb626176dc7b",
     "parentPath":"default/sample_07",
     "originalName":"total_emp",
     "name":"my_name",
     "description":"My description",
     "tags":["tag1","tag2"],
     "properties":{"property1":"value1","property2":"value2"}}'

HDFS PUT Managed Metadata Example

This example adds a property and a tag to the entity identified as "14", which happens to be the Hive email_preferences column in the customers sample table. The Approved property is a Boolean to indicate whether or not the metadata for this column was reviewed and approved. The ToBeReviewed tag marks the column temporarily so a data steward can easily find this column for review:
curl http://node1.example.com:7187/api/v13/entities/14 \
-u username:password \
-X PUT \
-H "Content-Type: application/json" \
-d '{"tags": ["ToBeReviewed"], \
     "customProperties": {"Operations": \
      {"Approved": false}}}'
The server responds:
{
  "originalName" : "email_preferences",
  "originalDescription" : null,
  "sourceId" : "7",
  "firstClassParentId" : "13",
  "parentPath" : "/default/customers",
  "deleteTime" : null,
  "extractorRunId" : "7##1",
  "customProperties" : {
    "Operations" : {
      "Approved" : true
    }
  },
  "name" : null,
  "description" : null,
  "tags" : [ "ToBeReviewed" ],
  "properties" : {
    "__cloudera_internal__hueLink" : "https://node1.example.com:8889/hue/metastore/table/default/customers"
  },
  "technicalProperties" : null,
  "dataType" : "struct<email_format:string,frequency:string,categories:struct<promos:boolean,surveys:boolean>>",
  "type" : "FIELD",
  "sourceType" : "HIVE",
  "userEntity" : false,
  "metaClassName" : "hv_column",
  "deleted" : false,
  "packageName" : "nav",
  "identity" : "14",
  "internalType" : "hv_column"
}

GET Existing Metadata Examples

This section shows some examples of useful GET /entity calls:

To retrieve a specific entity:

Use the entity identifier that shows in the URL in the Navigator console. This example gets the metadata for the entity with identifier 21302.
curl https://node1.example.com:7187/api/v14/entities/21302 \
-X GET -b NavCookie
To retrieve entities that describe the sources Navigator extracts data from:
curl https://node1.example.com:7187/api/v14/entities/?query=type%3ASOURCE \
-b NavCookie -X GET
To use the source in a query, find its identity: In the output, find the name of the source, then collect the identity for that entity. In this example, the HIVE-1 source has the identity "6".
{
  "originalName" : "HIVE-1",
  "originalDescription" : null,
  "sourceId" : null,
  "firstClassParentId" : null,
  "parentPath" : null,
  "deleteTime" : null,
  "extractorRunId" : null,
  "customProperties" : null,
  "name" : "HIVE-1",
  "description" : null,
  "tags" : null,
  "properties" : null,
  "technicalProperties" : null,
  "clusterName" : "Cluster 1",
  "sourceUrl" : "thrift://node2.example.com:9083",
  "sourceType" : "HIVE",
  "sourceExtractIteration" : 15,
  "sourceTemplate" : true,
  "hmsDbHost" : "node1.example.com",
  "hmsDbName" : "hive1",
  "hmsDbPort" : "3306",
  "hmsDbUser" : "hive1",
  "type" : "SOURCE",
  "userEntity" : false,
  "deleted" : null,
  "metaClassName" : "source",
  "packageName" : "nav",
  "identity" : "6",
  "internalType" : "source"
}

To retrieve all entities from a single source:

Run the previous call to determine the identity of the source, then use that value as the sourceId in the query. This example uses sourceId of 6.

curl https://node1.example.com:7187/api/v14/entities/?query=sourceId%3A6 \
-b NavCookie -X GET

To retrieve all entities marked with a specific tag:

Use quotes around the tag name. This example retrieves metadata for entities tagged with "sensitive".

curl https://node1.example.com:7187/api/v14/entities/?query=tags%3A%22sensitive%22 \
-b NavCookie -X GET

To retrieve all entities from a single source and marked with a tag:

This example includes a plus sign (+) before each of the query components to ensure they are treated as an AND relation. Note that this is a case where the query returns the correct results only with the ASCII codes.

curl https://node1.example.com:7187/api/v14/entities/?query=%2btags%3A%22sensitive%22%20%2bsourceId%3A6 \
-b cookie -X GET

To search for pre-registered entities:

This example shows how to get metadata for pre-registered entities, which do not include an internal type:

curl http://node1.example.com:7187/api/v13/entities/?query=-internalType:* \
-b cookie -X GET

Using the Cloudera Navigator SDK for Metadata Management

To facilitate working with metadata using the Cloudera Navigator APIs, Cloudera provides the Cloudera Navigator SDK. Cloudera Navigator SDK is a client library that provides functionality for extracting and enriching metadata with custom models, entities, and relationships. See GitHub cloudera/navigator-sdk for details.