Managing Hive

Apache Hive is a powerful data warehousing application for Hadoop. It enables you to access your data using Hive QL, a language similar to SQL.

Hive Roles

Hive is implemented in three roles:
  • Hive metastore - Provides metastore services when Hive is configured with a remote metastore.

    Cloudera recommends using a remote Hive metastore, especially for CDH 4.2 or higher. Because the remote metastore is recommended, Cloudera Manager treats the Hive Metastore Server as a required role for all Hive services. A remote metastore provides the following benefits:

    • The Hive metastore database password and JDBC drivers do not need to be shared with every Hive client; only the Hive Metastore Server does. Sharing passwords with many hosts is a security issue.
    • You can control activity on the Hive metastore database. To stop all activity on the database, stop the Hive Metastore Server. This makes it easy to back up and upgrade, which require all Hive activity to stop.
    See Configuring the Hive Metastore (CDH 4) or Configuring the Hive Metastore (CDH 5).

    For information about configuring a remote Hive metastore database with Cloudera Manager, see Cloudera Manager and Managed Service Datastores. To configure high availability for the Hive metastore, see Hive Metastore High Availability.

  • HiveServer2 - Enables remote clients to run Hive queries, and supports a Thrift API tailored for JDBC and ODBC clients, Kerberos authentication, and multi-client concurrency. A CLI named Beeline is also included. See HiveServer2 documentation (CDH 4) or HiveServer2 documentation (CDH 5) for more information.
  • WebHCat - HCatalog is a table and storage management layer for Hadoop that makes the same table information available to Hive, Pig, MapReduce, and Sqoop. Table definitions are maintained in the Hive metastore, which HCatalog requires. WebHCat allows you to access HCatalog using an HTTP (REST style) interface.

Hive Execution Engines

Hive in CDH supports two execution engines: MapReduce and Spark. To configure an execution engine perform one of following steps:
  • Beeline - (Can be set per query) Run the set hive.execution.engine=engine command, where engine is either mr or spark. The default is mr. For example:
    set hive.execution.engine=spark;
    To determine the current setting, run
    set hive.execution.engine;
  • Cloudera Manager (Affects all queries, not recommended).
    1. Go to the Hive service.
    2. Click the Configuration tab.
    3. Search for "execution".
    4. Set the Default Execution Engine property to MapReduce or Spark. The default is MapReduce.
    5. Click Save Changes to commit the changes.
    6. Return to the Home page by clicking the Cloudera Manager logo.
    7. Click to invoke the cluster restart wizard.
    8. Click Restart Stale Services.
    9. Click Restart Now.
    10. Click Finish.

Transaction (ACID) Support in Hive

The CDH distribution of Hive does not support transactions (HIVE-5317). Currently, transaction support in Hive is an experimental feature that only works with the ORC file format. Cloudera recommends using the Parquet file format, which works across many tools. Merge updates in Hive tables using existing functionality, including statements such as INSERT, INSERT OVERWRITE, and CREATE TABLE AS SELECT.