This is the documentation for CDH 4.7.1.
Documentation for other versions is available at Cloudera Documentation.

Configuring the Hive Metastore

The Hive metastore service stores the metadata for Hive tables and partitions in a relational database, and provides clients (including Hive) access to this information via the metastore service API. The subsections that follow discuss the deployment options and provide instructions for setting up a database in a recommended configuration.

Metastore Deployment Modes

  Note:

HiveServer in the discussion that follows refers to HiveServer1 or HiveServer2, whichever you are using.

Embedded Mode

Cloudera recommends using this mode for experimental purposes only.

This is the default metastore deployment mode for CDH. In this mode the metastore uses a Derby database, and both the database and the metastore service run embedded in the main HiveServer process. Both are started for you when you start the HiveServer process. This mode requires the least amount of effort to configure, but it can support only one active user at a time and is not certified for production use.

Local Mode



In this mode the Hive metastore service runs in the same process as the main HiveServer process, but the metastore database runs in a separate process, and can be on a separate host. The embedded metastore service communicates with the metastore database over JDBC.

Remote Mode

Cloudera recommends that you use this mode.

In this mode the Hive metastore service runs in its own JVM process; HiveServer2, HCatalog, Cloudera Impala™, and other processes communicate with it via the Thrift network API (configured via the hive.metastore.uris property). The metastore service communicates with the metastore database over JDBC (configured via the javax.jdo.option.ConnectionURL property). The database, the HiveServer process, and the metastore service can all be on the same host, but running the HiveServer process on a separate host provides better availability and scalability.

The main advantage of Remote mode over Local mode is that Remote mode does not require the administrator to share JDBC login information for the metastore database with each Hive user. HCatalogrequires this mode.

Supported Metastore Databases

See the CDH4 Requirements and Supported Versions page for up-to-date information on supported databases. Cloudera strongly encourages you to use MySQL because it is the most popular with the rest of the Hive user community, and so receives more testing than the other options.

Configuring the Metastore Database

This section describes how to configure Hive to use a remote database, with examples for MySQL and PostgreSQL.

The configuration properties for the Hive metastore are documented on the Hive Metastore documentation page, which also includes a pointer to the E/R diagram for the Hive metastore.

  Note:

For information about additional configuration that may be needed in a secure cluster, see Hive Security Configuration.

Configuring a remote MySQL database for the Hive Metastore

Cloudera recommends you configure a database for the metastore on one or more remote servers (that is, on a host or hosts separate from the HiveServer1 or HiveServer2 process). MySQL is the most popular database to use. Proceed as follows.

Step 1: Install and start MySQL if you have not already done so

To install MySQL on a Red Hat system:

$ sudo yum install mysql-server

To install MySQL on a SLES system:

$ sudo zypper install mysql
$ sudo zypper install libmysqlclient_r15

To install MySQL on an Debian/Ubuntu system:

$ sudo apt-get install mysql-server

After using the command to install MySQL, you may need to respond to prompts to confirm that you do want to complete the installation. After installation completes, start the mysql daemon.

On Red Hat systems

$ sudo service mysqld start

On SLES and Debian/Ubuntu systems

$ sudo service mysql start

Step 2: Configure the MySQL Service and Connector

Before you can run the Hive metastore with a remote MySQL database, you must configure a connector to the remote MySQL database, set up the initial database schema, and configure the MySQL user account for the Hive user.

To install the MySQL connector on a Red Hat 6 system:

Install mysql-connector-java and symbolically link the file into the /usr/lib/hive/lib/ directory.

$ sudo yum install mysql-connector-java
$ ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysql-connector-java.jar

To install the MySQL connector on a Red Hat 5 system:

Download the MySQL JDBC driver from http://www.mysql.com/downloads/connector/j/5.1.html. You will need to sign up for an account if you don't already have one, and log in, before you can download it. Then copy it to the /usr/lib/hive/lib/ directory. For example:

$ sudo cp mysql-connector-java-version/mysql-connector-java-version-bin.jar /usr/lib/hive/lib/
  Note: At the time of publication, version was 5.1.25, but the version may have changed by the time you read this.

To install the MySQL connector on a SLES system:

Install mysql-connector-java and symbolically link the file into the /usr/lib/hive/lib/ directory.

$ sudo zypper install mysql-connector-java
$ ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysql-connector-java.jar

To install the MySQL connector on a Debian/Ubuntu system:

Install mysql-connector-java and symbolically link the file into the /usr/lib/hive/lib/ directory.

  Note: For Ubuntu Precise systems, the name of the connector JAR file is mysql.jar. Modify the second command given below accordingly.
$ sudo apt-get install libmysql-java
$ ln -s /usr/share/java/libmysql-java.jar /usr/lib/hive/lib/libmysql-java.jar

Configure MySQL to use a strong password and to start at boot. Note that in the following procedure, your current root password is blank. Press the Enter key when you're prompted for the root password.

To set the MySQL root password:

$ sudo /usr/bin/mysql_secure_installation
[...]
Enter current password for root (enter for none):
OK, successfully used password, moving on...
[...]
Set root password? [Y/n] y
New password:
Re-enter new password:
Remove anonymous users? [Y/n] Y
[...]
Disallow root login remotely? [Y/n] N
[...]
Remove test database and access to it [Y/n] Y
[...]
Reload privilege tables now? [Y/n] Y
All done!

To make sure the MySQL server starts at boot:

  • On Red Hat systems:
$ sudo /sbin/chkconfig mysqld on
$ sudo /sbin/chkconfig --list mysqld
mysqld          0:off   1:off   2:on    3:on    4:on    5:on    6:off
  • On SLES systems:
$ sudo chkconfig --add mysql
  • On Debian/Ubuntu systems:
$ sudo chkconfig mysql on

Step 3. Create the Database and User

The instructions in this section assume you are using Remote mode, and that the MySQL database is installed on a separate host from the metastore service, which is running on a host named metastorehost in the example.

  Note:

If the metastore service will run on the host where the database is installed, replace 'metastorehost' in the CREATE USER example with 'localhost'. Similarly, the value of javax.jdo.option.ConnectionURL in /etc/hive/conf/hive-site.xml (discussed in the next step) must be jdbc:mysql://localhost/metastore. For more information on adding MySQL users, see http://dev.mysql.com/doc/refman/5.5/en/adding-users.html.

Create the initial database schema using the hive-schema-0.10.0.mysql.sql file located in the /usr/lib/hive/scripts/metastore/upgrade/mysql directory.

Example

$ mysql -u root -p
Enter password:
mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql;

You also need a MySQL user account for Hive to use to access the metastore. It is very important to prevent this user account from creating or altering tables in the metastore database schema.

  Important:

If you fail to restrict the ability of the metastore MySQL user account to create and alter tables, it is possible that users will inadvertently corrupt the metastore schema when they use older or newer versions of Hive.

Example

mysql> CREATE USER 'hive'@'metastorehost' IDENTIFIED BY 'mypassword';
...
mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'metastorehost';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'metastorehost';
mysql> FLUSH PRIVILEGES;
mysql> quit;

Step 4: Configure the Metastore Service to Communicate with the MySQL Database

This step shows the configuration properties you need to set in hive-site.xml to configure the metastore service to communicate with the MySQL database, and provides sample settings. Though you can use the same hive-site.xml on all hosts (client, metastore, HiveServer), hive.metastore.uris is the only property that must be configured on all of them; the others are used only on the metastore host.

Given a MySQL database running on myhost and the user account hive with the password mypassword, set the configuration as follows (overwriting any existing values).

  Note:

The hive.metastore.local property is no longer supported as of Hive 0.10; setting hive.metastore.uris is sufficient to indicate that you are using a remote metastore.

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://myhost/metastore</value>
  <description>the URL of the MySQL database</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>mypassword</value>
</property>

<property>
  <name>datanucleus.autoCreateSchema</name>
  <value>false</value>
</property>

<property>
  <name>datanucleus.fixedDatastore</name>
  <value>true</value>
</property>

<property>
  <name>datanucleus.autoStartMechanism</name> 
  <value>SchemaTable</value>
</property> 

<property>
  <name>hive.metastore.uris</name>
  <value>thrift://<n.n.n.n>:9083</value>
  <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>

Configuring a remote PostgreSQL database for the Hive Metastore

Before you can run the Hive metastore with a remote PostgreSQL database, you must configure a connector to the remote PostgreSQL database, set up the initial database schema, and configure the PostgreSQL user account for the Hive user.

Step 1: Install and start PostgreSQL if you have not already done so

To install PostgreSQL on a Red Hat system:

$ sudo yum install postgresql-server

To install PostgreSQL on a SLES system:

$ sudo zypper install postgresql-server

To install PostgreSQL on an Debian/Ubuntu system:

$ sudo apt-get install postgresql

After using the command to install PostgreSQL, you may need to respond to prompts to confirm that you do want to complete the installation. In order to finish installation on Red Hat compatible systems, you need to initialize the database. Please note that this operation is not needed on Ubuntu and SLES systems as it's done automatically on first start:

To initialize database files on Red Hat compatible systems

$ sudo service postgresql initdb

To ensure that your PostgreSQL server will be accessible over the network, you need to do some additional configuration.

First you need to edit the postgresql.conf file. Set the listen_addresses property to *, to make sure that the PostgreSQL server starts listening on all your network interfaces. Also make sure that the standard_conforming_strings property is set to off.

You can check that you have the correct values as follows:

On Red-Hat-compatible systems:

$ sudo cat /var/lib/pgsql/data/postgresql.conf  | grep -e listen -e standard_conforming_strings
listen_addresses = '*'
standard_conforming_strings = off

On SLES systems:

$ sudo cat /var/lib/pgsql/data/postgresql.conf  | grep -e listen -e standard_conforming_strings
listen_addresses = '*'
standard_conforming_strings = off

On Ubuntu and Debian systems:

$ cat /etc/postgresql/9.1/main/postgresql.conf | grep -e listen -e standard_conforming_strings
listen_addresses = '*'
standard_conforming_strings = off

You also need to configure authentication for your network in pg_hba.conf. You need to make sure that the PostgreSQL user that you will create in the next step will have access to the server from a remote host. To do this, add a new line into pg_hba.con that has the following information:

host    <database>         <user>         <network address>         <mask>               md5

The following example allows all users to connect from all hosts to all your databases:

host    all         all         0.0.0.0         0.0.0.0               md5
  Note:

This configuration is applicable only for a network listener. Using this configuration won't open all your databases to the entire world; the user must still supply a password to authenticate himself, and privilege restrictions configured in PostgreSQL will still be applied.

After completing the installation and configuration, you can start the database server:

Start PostgreSQL Server

$ sudo service postgresql start

Use chkconfig utility to ensure that your PostgreSQL server will start at a boot time. For example:

chkconfig postgresql on

You can use the chkconfig utility to verify that PostgreSQL server will be started at boot time, for example:

chkconfig --list postgresql

Step 2: Install the Postgres JDBC Driver

Before you can run the Hive metastore with a remote PostgreSQL database, you must configure a JDBC driver to the remote PostgreSQL database, set up the initial database schema, and configure the PostgreSQL user account for the Hive user.

To install the PostgreSQL JDBC Driver on a Red Hat 6 system:

Install postgresql-jdbc package and create symbolic link to the /usr/lib/hive/lib/ directory. For example:

$ sudo yum install postgresql-jdbc
$ ln -s /usr/share/java/postgresql-jdbc.jar /usr/lib/hive/lib/postgresql-jdbc.jar

To install the PostgreSQL connector on a Red Hat 5 system:

You need to manually download the PostgreSQL connector from http://jdbc.postgresql.org/download.html and move it to the /usr/lib/hive/lib/ directory. For example:

$ wget http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar
$ mv postgresql-9.2-1002.jdbc4.jar /usr/lib/hive/lib/
  Note:

You may need to use a different version if you have a different version of Postgres. You can check the version as follows:

$ sudo rpm -qa | grep postgres

To install the PostgreSQL JDBC Driver on a SLES system:

Install postgresql-jdbc and symbolically link the file into the /usr/lib/hive/lib/ directory.

$ sudo zypper install postgresql-jdbc
$ ln -s /usr/share/java/postgresql-jdbc.jar /usr/lib/hive/lib/postgresql-jdbc.jar

To install the PostgreSQL JDBC Driver on a Debian/Ubuntu system:

Install libpostgresql-jdbc-java and symbolically link the file into the /usr/lib/hive/lib/ directory.

$ sudo apt-get install libpostgresql-jdbc-java
$ ln -s /usr/share/java/postgresql-jdbc4.jar /usr/lib/hive/lib/postgresql-jdbc4.jar

Step 3: Create the metastore database and user account

Proceed as in the following example:

$ sudo –u postgres psql
postgres=# CREATE USER hiveuser WITH PASSWORD 'mypassword';
postgres=# CREATE DATABASE metastore;
postgres=# \c metastore;
You are now connected to database 'metastore'.
postgres=# \i /usr/lib/hive/scripts/metastore/upgrade/postgres/hive-schema-0.10.0.postgres.sql
SET
SET
...

Now you need to grant permission for all metastore tables to user hiveuser. PostgreSQL does not have statements to grant the permissions for all tables at once; you'll need to grant the permissions one table at a time. You could automate the task with the following SQL script:

  Note:

If you are running these commands interactively and are still in the Postgres session initiated at the beginning of this step, you do not need to repeat sudo -u postgres psql.

bash# sudo –u postgres psql
metastore=# \pset tuples_only on
metastore=# \o /tmp/grant-privs
metastore=#   SELECT 'GRANT SELECT,INSERT,UPDATE,DELETE ON "'  || schemaname || '". "' ||tablename ||'" TO hiveuser ;'
metastore-#   FROM pg_tables
metastore-#   WHERE tableowner = CURRENT_USER and schemaname = ' public';
metastore=# \o
metastore=# \pset tuples_only off
metastore=# \i /tmp/grant-privs

Step 4: Configure the Metastore Service to Communicate with the PostgreSQL Database

This step shows the configuration properties you need to set in hive-site.xml to configure the metastore service to communicate with the PostgreSQL database. Though you can use the same hive-site.xml on all hosts (client, metastore, HiveServer), hive.metastore.uris is the only property that must be configured on all of them; the others are used only on the metastore host.

Given a PostgreSQL database running on host myhost under the user account hive with the password mypassword, you would set configuration properties as follows.

  Note:
  • The instructions in this section assume you are using Remote mode, and that the PostgreSQL database is installed on a separate host from the metastore server.
  • The hive.metastore.local property is no longer supported as of Hive 0.10; setting hive.metastore.uris is sufficient to indicate that you are using a remote metastore.
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:postgresql://myhost/metastore</value>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.postgresql.Driver</value>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hiveuser</value>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>mypassword</value>
</property>
 
<property>
  <name>datanucleus.autoCreateSchema</name>
  <value>false</value>
</property>

<property>
  <name>hive.metastore.uris</name>
  <value>thrift://<n.n.n.n>:9083</value>
  <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>

Step 6: Test connectivity to the metastore:

$ hive –e “show tables;”
  Note:

This will take a while the first time.

Configuring a remote Oracle database for the Hive Metastore

Before you can run the Hive metastore with a remote Oracle database, you must configure a connector to the remote Oracle database, set up the initial database schema, and configure the Oracle user account for the Hive user.

Step 1: Install and start Oracle

The Oracle database is not part of any Linux distribution and must be purchased, downloaded and installed separately. You can use the Express edition, which can be downloaded free from Oracle website.

Step 2: Install the Oracle JDBC Driver

You must download the Oracle JDBC Driver from the Oracle website and put the file ojdbc6.jar into /usr/lib/hive/lib/ directory. The driver is available for download here.
  Note:

These URLs were correct at the time of publication, but the Oracle site is restructured frequently.

$ sudo mv ojdbc6.jar /usr/lib/hive/lib/

Step 3: Create the Metastore database and user account

Connect to your Oracle database as an administrator and create the user that will use the Hive metastore.

$ sqlplus "sys as sysdba"
SQL> create user hiveuser identified by mypassword;
SQL> grant connect to hiveuser;
SQL> grant all privileges to hiveuser;

Connect as the newly created hiveuser user and load the initial schema:

$ sqlplus hiveuser
SQL> @/usr/lib/hive/scripts/metastore/upgrade/oracle/hive-schema-0.10.0.oracle.sql

Connect back as an administrator and remove the power privileges from user hiveuser. Then grant limited access to all the tables:

$ sqlplus "sys as sysdba"
SQL> revoke all privileges from hiveuser;
SQL> BEGIN
  2     FOR R IN (SELECT owner, table_name FROM all_tables WHERE owner='HIVEUSER') LOOP
  3        EXECUTE IMMEDIATE 'grant  SELECT,INSERT,UPDATE,DELETE on '||R.owner||'.'||R.table_name||' to hiveuser';
  4     END LOOP;
  5  END;
  6  
  7  /

Step 4: Configure the Metastore Service to Communicate with the Oracle Database

This step shows the configuration properties you need to set in hive-site.xml to configure the metastore service to communicate with the Oracle database, and provides sample settings. Though you can use the same hive-site.xml on all hosts (client, metastore, HiveServer), hive.metastore.uris is the only property that must be configured on all of them; the others are used only on the metastore host.

Example

Given an Oracle database running on myhost and the user account hiveuser with the password mypassword, set the configuration as follows (overwriting any existing values):

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:oracle:thin:@//myhost/xe</value>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>oracle.jdbc.OracleDriver</value>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hiveuser</value>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>mypassword</value>
</property>
 
<property>
  <name>datanucleus.autoCreateSchema</name>
  <value>false</value>
</property>
 
<property>
  <name>datanucleus.fixedDatastore</name>
  <value>true</value>
</property>
 
<property>
  <name>hive.metastore.uris</name>
  <value>thrift://<n.n.n.n>:9083</value>
  <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>