Metadata Extraction and Indexing

The Navigator Metadata Server extracts metadata for the resource types listed in the table.
Metadata Extraction by Resource Type (Service, Role)
Resource Type Metadata Extracted
HDFS HDFS metadata at the next scheduled extraction run after an HDFS checkpoint. If high availability is enabled, metadata is extracted as soon as it is written to the JournalNodes.
Hive Database, table, and query metadata from Hive lineage logs. See Managing Hive and Impala Lineage Properties. Hive entities include tables that result from Impala queries and Sqoop jobs.
Impala Database, table, and query metadata from the Impala Daemon lineage logs. See Managing Hive and Impala Lineage Properties.
MapReduce Job metadata from the JobTracker. The default setting in Cloudera Manager retains a maximum of five jobs; if you run more than five jobs between Navigator extractions, the Navigator Metadata Server extracts the five most recent jobs.
Oozie Oozie workflows from the Oozie Server.
Pig Pig script runs from the JobTracker or Job History Server.
S3 Bucket and object metadata.
Spark Spark job metadata from YARN logs. (Unsupported and disabled by default. To enable, see Enabling Spark Metadata Extraction.)
Sqoop 1 Database and table metadata from Hive lineage logs; job runs from the JobTracker or Job History Server.
YARN Job metadata from the ResourceManager.

An entity created at system time t0 is extracted and linked by Cloudera Navigator after the 10-minute (default) extraction poll period and the appropriate service-specific interval, as follows:

  • HDFS: t0 + (extraction poll period) + (HDFS checkpoint interval (1 hour by default))
  • HDFS + HA: t0 + (extraction poll period)
  • Hive: t0 + (extraction poll period) + (Hive maximum wait time (60 minutes by default)
  • Impala: t0 + (extraction poll period)

Metadata Indexing

After metadata is extracted, it is indexed and made available for searching by the embedded Solr engine. The Solr instance indexes entity properties and the relationships between entities.

Use the Cloudera Navigator console to search entity metadata. Relationship metadata is implicitly visible in lineage diagrams and explicitly available by downloading the lineage using the Cloudera Navigator APIs.