Hive and Impala Lineage Configuration

Cloudera Manager Required Role: Configurator (or Cluster Administrator, or Full Administrator)

Unlike for other services running in the cluster (such as Pig), lineage data from Hive and Impala queries is not extracted by Navigator Metadata Server. Instead, these two services write query data to log files collected in a specific directory on the cluster node. The Cloudera Manager Agent process running on that node monitors the directory and routinely sends the log files to the Navigator Metadata Server, where the query data is coalesced with other metadata collected by the system.

Lineage collection from Hive and from Impala log files is enabled by default—each of these services has its own Enable Lineage Collection property and some related configuration properties, which can be disabled or reconfigured as detailed below.

Modifying Lineage Collection Settings for Hive

Property Default Description
Enable Lineage Collection Enabled for Hive, Service-Wide Enable collection of lineage from the service's roles.
Hive Lineage Log Directory (lineage_event_log_dir) 100 MiB Maximum size (MiB, GiB) of Hive lineage log file before a new file is created.
Hive Maximum Lineage Log File Size (max_lineage_log_file_size) /var/log/hive/lineage Directory in which Hive lineage log files are written.

To disable Hive lineage collection:

control whether the Impala Daemon role logs to the lineage log and whether the Cloudera Manager Agent collects the Hive and Impala lineage entries:
  1. Log in to Cloudera Manager Admin Console.
  2. Select Clusters > Hive.
  3. Click the Configuration tab.
  4. Type lineage in the Search box.
  5. Click the Enable Lineage Collection check-box to deselect it and disable lineage collection.
  6. Click Save Changes.
  7. Restart the Hive service.

Modifying Lineage Collection Settings for Impala

Property Default Description
Enable Impala Lineage Generation (enable_lineage_log) Enabled for the Impala daemon default group When enabled, Impala daemon process creates a logfile containing lineage data and stores it in the directory specified by the Impala Daemon Log Lineage Directory property.
Enable Lineage Collection Enabled for Impala Service-Wide Enable collection of lineage from the service's roles.
Impala Daemon Lineage Log Directory (lineage_event_log_dir) /var/log/impalad/lineage Directory in which Impala daemon lineage log files are written. When Impala Lineage Generation property is enabled, Impala generates its lineage logs in this directory.
Impala Daemon Maximum Lineage Log File Size (max_lineage_log_file_size) 5000 Maximum number of Impala daemon lineage log file entries (queries) written to file before a new file is created.

The Enable Lineage Collection property determines whether lineage logs should be collected by the Cloudera Manager Agent. To control whether the Impala Daemon role logs to the lineage log and whether the Cloudera Manager Agent collects the Hive and Impala lineage entries:

To disable lineage collection for Impala queries:

  1. Log in to Cloudera Manager Admin Console.
  2. Select Clusters > Impala.
  3. Click the Configuration tab.
  4. Type lineage in the Search box.
  5. Click the Enable Lineage Collection check-box to deselect it and disable lineage collection.
  6. Click the Enable Impala Lineage Generation check-box to deselect it.
  7. Click Save Changes.
  8. Restart the Impala service.
De-selecting either Enable Lineage Collection or Enable Impala Lineage Generation disables lineage collection for Impala.

Configuring Hive on Spark and Impala Daemon Lineage Logs

If the value of a log directory property is changed, and service is restarted, the Cloudera Manager Agent starts monitoring the new log directory. In this case it is possible that not all events are published from the old directory. To avoid losing lineage information when this property is changed, perform the following steps:
  1. Stop the affected service.
  2. Copy the lineage log files and (for Impala only) the impalad_lineage_wal file from the old log directory to the new log directory. This needs to be done on the HiveServer2 host and all the hosts where Impala Daemon roles are running.
  3. Start the service.

To edit lineage log properties:

  1. Go to the service.
  2. Click the Configuration tab.
  3. Type lineage in the Search box.
  4. Edit the lineage log properties.
  5. Click Save Changes.
  6. Restart the service.