Using the Hive Metastore with Kudu

Kudu has an optional feature which allows it to integrate its own catalog with the Hive Metastore (HMS). The HMS is the de-facto standard catalog and metadata provider in the Hadoop ecosystem. When the HMS integration is enabled, Kudu tables can be discovered and used by external HMS-aware tools, even if they are not otherwise aware of, or integrated with Kudu. Additionally, these components can use the HMS to discover necessary information to connect to the Kudu cluster which owns the table, such as the Kudu master addresses.

Databases and Table Names

With the Hive Metastore integration disabled, Kudu presents tables as a single flat namespace, with no hierarchy or concept of a database. Additionally, Kudu's only restriction on table names is that they be a valid UTF-8 encoded string. When the HMS integration is enabled in Kudu, both of these properties change in order to match the HMS model: the table name must indicate the table's membership of a Hive database, and table name identifiers (i.e. the table name and database name) are subject to the Hive table name identifier constraints.

Databases

Hive has the concept of a database, which is a collection of individual tables. Each database forms its own independent namespace of table names.

In order to fit into this model, Kudu tables must be assigned a database when the HMS integration is enabled. No new APIs have been added to create or delete databases, nor are there APIs to assign an existing Kudu table to a database. Instead, a new convention has been introduced that Kudu table names must be in the format <hive-database-name>.<hive-table-name>. Thus, databases are an implicit part of the Kudu table name. By including databases as an implicit part of the Kudu table name, existing applications that use Kudu tables can operate on non-HMS-integrated and HMS-integrated table names with minimal or no changes.

Kudu provides no additional tooling to create or drop Hive databases. Administrators or users should use existing Hive tools such as the Beeline Shell or Impala to do so.

Naming Constraints

When the Hive Metastore integration is enabled, the database and table names of Kudu tables must follow the Hive Metastore naming constraints. Namely, the database and table name must contain only alphanumeric ASCII characters and underscores.

Additionally, the Hive Metastore does not enforce case sensitivity for table name identifiers. As such, when enabled, Kudu will follow suit and disallow tables from being created when one already exists whose table name identifier differs only by case. Operations that open, alter, or drop tables will also be case-insensitive for the table name identifiers.

Metadata Synchronization

When the Hive Metastore integration is enabled, Kudu will automatically synchronize metadata changes to Kudu tables between Kudu and the HMS. As such, it is important to always ensure that the Kudu and HMS have a consistent view of existing tables, using the administrative tools described in the below section. Failure to do so may result in issues like Kudu tables not being discoverable or usable by external, HMS-aware components (e.g. Apache Sentry, Apache Impala).

Impala has notions of internal and external Kudu tables. When dropping an internal table from Impala, the table's data is dropped in Kudu; in contrast when dropping an external table, the table's data is not dropped in Kudu. External tables may refer to tables by names that are different from the names of the underlying Kudu tables, while internal tables must use the same names as those stored in Kudu. Additionally, multiple external tables may refer to the same underlying Kudu table. Thus, since external tables may not map one-to-one with Kudu tables, the Hive Metastore integration and tooling will only automatically synchronize metadata for internal tables. See the Kudu-Impala integration documentation for more information about table types in Impala

Enabling the Hive Metastore Integration

Before enabling the Hive Metastore integration on an existing cluster, make sure to upgrade any tables that may exist in Kudu's or in the HMS's catalog. See the Upgrading Existing Tables section for more details.

Setup using Cloudera Manager

  1. When the Hive Metastore is configured with fine-grained authorization using Apache Sentry, and the Sentry HDFS Sync feature is enabled, the Kudu admin need to be able to access and modify directories that are created for Kudu by the HMS. This can be done by adding the Kudu admin user to the group of the Hive service users, e.g. by running the usermod -aG hive kudu command on the HMS nodes.
  2. Go to the Hive service.
  3. Click the Configuration tab.
  4. Select the Kudu Service with which the Hive Metastore will synchronize the Kudu tables.

Manual Setup

  1. When the Hive Metastore is configured with fine-grained authorization using Apache Sentry and the Sentry HDFS Sync feature is enabled, the Kudu admin needs to be able to access and modify directories that are created for Kudu by the HMS. This can be done by adding the Kudu admin user to the group of the Hive service users, e.g. by running the usermod -aG hive kudu command on the HMS nodes.
  2. Configure the Hive Metastore to include the notification event listener and the Kudu HMS plugin, to allow altering and dropping columns, and to add full Thrift objects in notifications. Add the following values to the HMS configuration in hive-site.xml:
    <property>
      <name>hive.metastore.transactional.event.listeners</name>
      <value>
        org.apache.hive.hcatalog.listener.DbNotificationListener,
        org.apache.kudu.hive.metastore.KuduMetastorePlugin
      </value>
    </property>
    
    <property>
      <name>hive.metastore.disallow.incompatible.col.type.changes</name>
      <value>false</value>
    </property>
    
    <property>
      <name>hive.metastore.notifications.add.thrift.objects</name>
      <value>true</value>
    </property>
  3. After building Kudu from source, add the hms-plugin.jar found under the build directory (e.g. build/release/bin) to the HMS classpath.
  4. Restart the HMS.
  5. Enable the Hive Metastore integration in Kudu with the following configuration properties for the Kudu master(s):

    --hive_metastore_uris=<HMS Thrift URI(s)>
    --hive_metastore_sasl_enabled=<value of the Hive Metastore's hive.metastore.sasl.enabled configuration>
  6. Restart the Kudu master(s).

Administrative Tools

Kudu provides the command line tools kudu hms list, kudu hms precheck, kudu hms check, and kudu hms fix to allow administrators to find and fix metadata inconsistencies between the internal Kudu catalog and the Hive Metastore catalog, during the upgrade process described below or during the normal operation of a Kudu cluster.

kudu hms tools should be run from the command line as the Kudu admin user. They require the full list of master addresses to be specified:
$ sudo -u kudu kudu hms check master-name-1:7051,master-name-2:7051,master-name-3:7051

To see a full list of the options available with the kudu hms tool, use the --help flag.

kudu hms list

The kudu hms list tool scans the Hive Metastore catalog, and lists the HMS entries (including table name and type) for Kudu tables, as indicated by their HMS storage handler.

kudu hms check

The kudu hms check tool scans the Kudu and Hive Metastore catalogs, and validates that the two catalogs agree on what Kudu tables exist. The tool will make suggestions on how to fix any inconsistencies that are found. Typically, the suggestion will be to run the kudu hms fix tool, however some certain inconsistencies require using Impala Shell for fixing.

kudu hms precheck

The kudu hms precheck tool scans the Kudu catalog and validates that if there are multiple Kudu tables whose names only differ by case and logs the conflicted table names.

kudu hms fix

kudu hms fix tool analyzes the Kudu and HMS catalogs and attempts to fix any automatically-fixable issues, for instance, by creating a table entry in the HMS for each Kudu table that doesn't already have one. The --dryrun option shows the proposed fix instead of actually executing it. When no automatic fix is available, it will make suggestions on how a manual fix can help.

kudu hms downgrade

The kudu hms downgrade downgrades the metadata to legacy format for Kudu and the Hive Metastores. It is discouraged to use unless necessary, since the legacy format can be deprecated in future releases.

Upgrading Existing Tables

Before enabling the Kudu-HMS integration, it is important to ensure that the Kudu and HMS start with a consistent view of existing tables. This may entail renaming Kudu tables to conform to the Hive naming constraints. This detailed workflow describes how to upgrade existing tables before enabling the Hive Metastore integration.

Prepare for the Upgrade

  • Establish a maintenance window. During this time the Kudu cluster will still be available, but tables in Kudu and the Hive Metastore may be altered or renamed as a part of the upgrade process.
  • Make note of all external tables using the following command and drop them. This reduces the chance of having naming conflicts with Kudu tables which can lead to errors during upgrading process. It also helps in cases where a catalog upgrade breaks external tables, due to the underlying Kudu tables being renamed. The external tables can be recreated after upgrade is complete.
    $ sudo -u kudu kudu hms list master-name-1:7051,master-name-2:7051,master-name-3:7051

Perform the Upgrade

  1. Run the kudu hms precheck tool to ensure no Kudu tables only differ by case. If the tool does not report any warnings, you can skip the next step.
    $ sudo -u kudu kudu table rename_table master-name-1:7051,master-name-2:7051,master-name-3:7051 <conflicting_table_name> <new_table_name>
  2. Run the kudu hms check tool using the following command. If the tool does not report any catalog inconsistencies, skip to Step 7 below.
    $ sudo -u kudu kudu hms check master-name-1:7051,master-name-2:7051,master-name-3:7051 --hive_metastore_uris=<hive_metastore_uris> [--ignore_other_clusters=<ignores_other_clusters>]
    For example:
    $ sudo -u kudu kudu hms check master-name-1:7051,master-name-2:7051,master-name-3:7051 --hive_metastore_uris=thrift://hive-metastore:9083 --ignore_other_clusters=false
  3. If the kudu hms check tool reports an inconsistent catalog, perform a dry-run of the kudu hms fix tool to understand how the tool will attempt to address the automatically-fixable issues.
    $ sudo -u kudu kudu hms fix master-name-1:7051,master-name-2:7051,master-name-3:7051 --hive_metastore_uris=<hive_metastore_uris> --dryrun=true [--ignore_other_clusters=<ignore_other_clusters>]
    For example:
    $ sudo -u kudu kudu hms check master-name-1:7051,master-name-2:7051,master-name-3:7051 --hive_metastore_uris=thrift://hive-metastore:9083 --dryrun=true --ignore_other_clusters=false
  4. Manually fix any issues that are reported by the check tool that cannot be automatically fixed. For example, rename any tables with names that are not Hive-conformant.
  5. Run the kudu hms fix tool to automatically fix all the remaining issues.
    $ sudo -u kudu kudu hms fix master-name-1:7051,master-name-2:7051,master-name-3:7051 --hive_metastore_uris=<hive_metastore_uris> [--drop_orphan_hms_tables=<drops_orphan_hms_tables>] [--ignore_other_clusters=<ignore_other_clusters>]
    For example:
    $ sudo -u kudu kudu hms fix master-name-1:7051,master-name-2:7051,master-name-3:7051 --hive_metastore_uris=thrift://hive-metastore:9083 --ignore_other_clusters=false
  6. Recreate any external tables that were dropped when preparing for the upgrade by using Impala Shell.
  7. Enable the Hive Metastore Integration as described in the Enabling the Hive Metastore Integration section.