Using Apache Hive with CDH
Hive data warehouse software enables reading, writing, and managing large datasets in distributed storage. Using the Hive query language (HiveQL), which is very similar to SQL, queries are converted into a series of jobs that execute on a Hadoop cluster through MapReduce or Apache Spark.
Users can run batch processing workloads with Hive while also analyzing the same data for interactive SQL or machine-learning workloads using tools like Apache Impala (incubating) or Apache Spark—all within a single platform.
As part of CDH, Hive also benefits from:
- Unified resource management provided by YARN
- Simplified deployment and administration provided by Cloudera Manager
- Shared security and governance to meet compliance requirements provided by Apache Sentry and Cloudera Navigator
- Use Cases for Hive
- The Metastore Database
- Hive on Spark
- Hive and HBase
- Hive on Amazon S3
- Hive on Microsoft Azure Data Lake Store
- Transaction (ACID) Support in Hive
- Managing Hive
- Ingesting Data with Hive
- High Availability
- Upstream Information for Hive
Use Cases for Hive
Because Hive is a petabyte-scale data warehouse system built on the Hadoop platform, it is a good choice for environments experiencing phenomenal growth in data volume. The underlying MapReduce interface with HDFS is hard to program directly, but Hive provides an SQL interface, making it possible to use existing programming skills to perform data preparation.
Hive on MapReduce or Spark is best-suited for batch data preparation or ETL:
You must run scheduled batch jobs with very large ETL sorts with joins to prepare data for Hadoop. Most data served to BI users in Impala is prepared by ETL developers using Hive.
You run data transfer or conversion jobs that take many hours. With Hive, if a problem occurs partway through such a job, it recovers and continues.
You receive or provide data in diverse formats, where the Hive SerDes and variety of UDFs make it convenient to ingest and convert the data. Typically, the final stage of the ETL process with Hive might be to a high-performance, widely supported format such as Parquet.
On a cluster managed by Cloudera Manager, Hive comes along with the base CDH installation and does not need to be installed separately. With Cloudera Manager, you can enable or disable the Hive service, but the Hive component always remains present on the cluster.
On an unmanaged cluster, you can install Hive manually, using packages or tarballs with the appropriate command for your operating system.
$ sudo yum install <pkg1> <pkg2> ...
$ sudo zypper install <pkg1> <pkg2> ...
|Ubuntu or Debian||
$ sudo apt-get install <pkg1> <pkg2> ...
- hive – base package that provides the complete language and runtime
- hive-metastore – provides scripts for running the metastore as a standalone service (optional)
- hive-server2 – provides scripts for running HiveServer2
- hive-hbase - optional; install this package if you want to use Hive with HBase.
See Installing Hive for details about installing and configuring Hive components.
To access the Hive server with JDBC clients, such as Beeline, install the JDBC driver for HiveServer2 that is defined in org.apache.hive.jdbc.HiveDriver.
See Installing the Hive JDBC Driver on Clients for details about installing the JDBC drivers and the connection URLs to use to connect to HiveServer2 from Hive clients.
Upgrade Hive on all the hosts on which it is running including both servers and clients.
See Upgrading Hive for details about deprecated versions, upgrading best practices, and information about upgrading the Hive metastore schema.
Hive offers a number of configuration settings related to performance, file layout and handling, and options to control SQL semantics. Depending on your cluster size and workloads, configure HiveServer2 memory, table locking behavior, and authentication for connections. See Configuring HiveServer2 for details about required configuration changes that you must perform.
The Hive metastore service, which stores the metadata for Hive tables and partitions, must also be configured. See Configuring the Hive Metastore for details about deployment modes, information about supported metastore databases, and specific configurations for MySQL, PostgreSQL, and Oracle.
To configure Hive to use the Amazon S3 filesystem for transient ETL jobs, see Configuring Transient Hive ETL Jobs to Use the Amazon S3 Filesystem
The Metastore Database
The metastore database is an important aspect of the Hive infrastructure. It is a separate database, relying on a traditional RDBMS such as MySQL or PostgreSQL, that holds metadata about Hive databases, tables, columns, partitions, and Hadoop-specific information such as the underlying data files and HDFS block locations.
The metastore database is shared by other components. For example, the same tables can be inserted into, queried, altered, and so on by both Hive and Impala. Although you might see references to the "Hive metastore", be aware that the metastore database is used broadly across the Hadoop ecosystem, even in cases where you are not using Hive itself.
The metastore database is relatively compact, with fast-changing data. Backup, replication, and other kinds of management operations affect this database. See Configuring the Hive Metastore for details about configuring the Hive metastore.
Cloudera recommends that you deploy the Hive metastore, which stores the metadata for Hive tables and partitions, in "remote mode." In this mode the metastore service runs in its own JVM process and other services, such as HiveServer2, HCatalog, and Apache Impala (incubating) communicate with the metastore using the Thrift network API.
See Starting the Metastore for details about starting the Hive metastore service.
HiveServer2 is a server interface that enables remote clients to submit queries to Hive and retrieve the results. It replaces HiveServer1, which has been deprecated and will be removed in a future release of CDH. HiveServer2 supports multi-client concurrency, capacity planning controls, Sentry authorization, Kerberos authentication, LDAP, SSL, and provides better support for JDBC and ODBC clients.
HiveServer2 is a container for the Hive execution engine. For each client connection, it creates a new execution context that serves Hive SQL requests from the client. It supports JDBC clients, such as the Beeline CLI, and ODBC clients. Clients connect to HiveServer2 through the Thrift API-based Hive service.
See Configuring HiveServer2 for details on configuring HiveServer2 and see Starting, Stopping, and Using HiveServer2 for details on starting/stopping the HiveServer2 service and information about using the Beeline CLI to connect to HiveServer2. For details about managing HiveServer2 with its native web user interface (UI), see HiveServer2 Web UI.
Hive on Spark
Hive traditionally uses MapReduce behind the scenes to parallelize the work, and perform the low-level steps of processing a SQL statement such as sorting and filtering. Hive can also use Spark as the underlying computation and parallelization engine. See Running Hive on Spark for details about configuring Hive to use Spark as its execution engine and see Tuning Hive on Spark for details about tuning Hive on Spark.
Hive and HBase
Apache HBase is a NoSQL database that supports real-time read/write access to large datasets in HDFS. See Using Hive with HBase for details about configuring Hive to use HBase. For information about running Hive queries on a secure HBase server, see Using Hive to Run Queries on a Secure HBase Server.
Hive on Amazon S3
Use the Amazon S3 filesystem to efficiently manage transient Hive ETL (extract-transform-load) jobs. For step-by-step instructions to configure Hive to use S3 and multiple scripting examples, see Configuring Transient Hive ETL Jobs to Use the Amazon S3 Filesystem. To optimize how Hive writes data to and reads data from S3-backed tables and partitions, see Tuning Hive Performance on the Amazon S3 Filesystem. For information about setting up a shared Amazon Relational Database Service (RDS) as your Hive metastore, see How To Set Up a Shared Amazon RDS as Your Hive Metastore.
Hive on Microsoft Azure Data Lake Store
In CDH 5.11, Hive on MapReduce2 can access tables on Microsoft Azure Data Lake store (ADLS). In contrast to Amazon S3, ADLS more closely resembles native HDFS behavior, providing consistency, file directory structure, and POSIX-compliant ACLs. See Configuring Azure Data Lake Store to Use with CDH for information about configuring and using ADLS with Hive on MapReduce2.
Transaction (ACID) Support in Hive
The CDH distribution of Hive does not support transactions (HIVE-5317). Currently, transaction support in Hive is an experimental feature that only works with the ORC file format. Cloudera recommends using the Parquet file format, which works across many tools. Merge updates in Hive tables using existing functionality, including statements such as INSERT, INSERT OVERWRITE, and CREATE TABLE AS SELECT.
Cloudera recommends using Cloudera Manager to manage Hive services, which are called managed deployments. If yours is not a managed deployment, configure HiveServer2 Web UI to manage Hive services.
Using Cloudera Manager to Manage Hive
Cloudera Manager uses the Hive metastore, HiveServer2, and the WebHCat roles to manage the Hive service across your cluster. Using Cloudera Manager, you can configure the Hive metastore, the execution engine (either MapReduce or Spark), and manage HiveServer2.
Using HiveServer2 Web UI to Manage Hive
The HiveServer2 web UI provides access to Hive configuration settings, local logs, metrics, and information about active sessions and queries. The HiveServer2 web UI is enabled in newly created clusters running CDH 5.7 and higher, and those using Kerberos are configured for SPNEGO. Clusters upgraded from a previous CDH version must be configured to enable the web UI; see HiveServer2 Web UI Configuration.
Ingesting Data with Hive
Hive can ingest data into several different file formats, such as Parquet, Avro, TEXTFILE, or RCFile. If you are setting up a data pipeline where Apache Impala (incubating) is involved on the query side, use Parquet. See Using Apache Parquet Data Files with CDH for general information about the Parquet file format and for information about using Parquet tables in Hive. If a custom file format is required, you can extend the Hive SerDes. See the Apache Hive wiki for information about the Hive SerDes and how to write your own for Hive.
See Using Avro Data Files in Hive for details about using Avro to ingest data into Hive tables and about using Snappy compression on the output files.
Column and Table Statistics for Query Optimization
Statistics for Hive can be numbers of rows of tables or partitions and the histograms of interesting columns. Statistics are used by the cost functions of the query optimizer to generate query plans for the purpose of query optimization.
See Hive Table Statistics for details about collecting statistics for Hive.
Tuning Hive consists of configuring numerous Hive parameters for better performance and scalability. The most important among these settings is configuring sufficient memory for HiveServer2 and the Hive metastore. This includes allocating memory for heap size based upon the number of concurrent connections that are typical for your deployment. Configuring garbage collection limits and keeping the number of table partitions below recommended limits are also important when tuning Hive performance. See Tuning Hive for details about recommended limits and best practices. If you are using Spark as your execution engine, see Tuning Hive on Spark.
Enable high availability for Hive by configuring a load balancer to manage HiveServer2 and by enabling high availability for the Hive metastore.
To enable high availability for multiple HiveServer2 hosts, configure a load balancer to manage them. To increase stability and security, configure the load balancer on a proxy server.
See HiveServer2 High Availability for details about configuring a load balancer for HiveServer2.
You can enable Hive metastore high availability (HA) so that your cluster is resilient to failures if a metastore becomes unavailable. When HA mode is enabled, one of the metastores is designated as the master and the others are slaves. If a master metastore fails, one of the slave metastores takes over.
See Hive Metastore High Availability for details about enabling the metastore for high availability.
Hive/Impala replication enables you to copy (replicate) your Hive metastore and data from one cluster to another and synchronize the Hive metastore and data set on the destination cluster with the source, based on a specified replication schedule. The destination cluster must be managed by the Cloudera Manager Server where the replication is being set up, and the source cluster can be managed by that same server or by a peer Cloudera Manager Server.
See Hive/Impala Replication for details about using Cloudera Manager to set up data replication for Hive.
Securing Hive involves configuring or enabling:
Authentication for Hive Metastore, HiveServer2, and all Hive clients with your deployment of LDAP and Kerberos for your cluster.
Authorization for HiveServer2 using role-based, fine-grained authorization that is implemented with Apache Sentry policies. You must configure HiveServer2 authentication before you configure authorization because Apache Sentry depends on an underlying authentication framework to reliably identify the requesting user.
Encryption to secure the network connection between HiveServer2 and Hive clients.
Starting with CDH 5.5, encryption for HiveServer2 clients has been decoupled from the authentication mechanism. This means you can use either SASL QOP or TLS/SSL to encrypt traffic between HiveServer2 and its clients, irrespective of whether Kerberos is being used for authentication. Previously, the JDBC client drivers only supported SASL QOP encryption on Kerberos-authenticated connections.
SASL QOP encryption is better suited for encrypting RPC communication and may result in performance issues when dealing with large amounts of data. Move to using TLS/SSL encryption to avoid such issues.
This topic describes how to set up encrypted communication between HiveServer2 and its JDBC/ODBC client drivers.
See Troubleshooting Hive for partitioning recommendations and troubleshooting failed Hive queries.
Using the native web interface for HiveServer2 provides access to Hive configuration settings, local logs, metrics, and information about sessions and queries. See HiveServer2 Web UI for details about accessing, configuring, and using HiveServer2 Web UI.
For additional Hive documentation, see the Apache Hive wiki.
Upstream Information for Hive
Detailed Hive documentation is available on the Apache Software Foundation site on the Hive project page. For specific areas of the Apache Hive documentation, see:
- Hive Query Language (Hive QL) Manual (for SQL syntax)
- Apache Hive wiki
- User Documentation
- Administrator Documentation
Because Cloudera does not support all Hive features, for example ACID (transactions), always check external Hive documentation against the current version and supported features of Hive included in CDH distribution.
Hive has its own JIRA issue tracker.