This is the documentation for Cloudera 5.4.x. Documentation for other versions is available at Cloudera Documentation.

Installing Hive

Install the appropriate Hive packages using the appropriate command for your distribution.

OS Command
RHEL-compatible
$ sudo yum install <pkg1> <pkg2> ...
SLES
$ sudo zypper install <pkg1> <pkg2> ...
Ubuntu or Debian
$ sudo apt-get install <pkg1> <pkg2> ...

The packages are:

  • hive – base package that provides the complete language and runtime
  • hive-metastore – provides scripts for running the metastore as a standalone service (optional)
  • hive-server2 – provides scripts for running HiveServer2
  • hive-hbase - optional; install this package if you want to use Hive with HBase.

Configuring Heap Size and Garbage Collection for Hive Components

HiveServer2 and the Hive metastore require sufficient memory in order to run correctly. The default heap size of 256 MB for each component is inadequate for production workloads. Consider the following guidelines for sizing the heap for each component, based upon your cluster size.
Table 1. Hive Heap Size Recommendations
Cluster Size HiveServer2 Heap Size Hive Metastore Heap Size
100 nodes or larger 24 GB 24 GB
50-99 nodes 12 GB 12 GB
11-49 nodes 6 GB 6 GB
2-10 nodes 2 GB 2 GB
1 node 256 MB 256 MB

In addition, workstations running The Beehive CLI should use a heap size of at least 2 GB.

Configuring Heap Size and Garbage Collection for Hive Components

To configure the heap size for HiveServer2 and Hive metastore, use the hive-env.sh advanced configuration snippet if you use Cloudera Manager, or edit /etc/hive/hive-env.sh otherwise, and set the -Xmx parameter in the HADOOP_OPTS variable to the desired maximum heap size.

To configure the heap size for the Beehive CLI, use the hive-env.sh advanced configuration snippet if you use Cloudera Manager, or edit /etc/hive/hive-env.sh otherwise, and set the HADOOP_HEAPSIZE environment variable before starting the Beehive CLI.

The following example shows a configuration with the following settings:
  • HiveServer2 uses 12 GB heap
  • Hive metastore heap uses 12 GB heap
  • Hive clients use 2 GB heap
The settings to change are in bold. All of these lines are commented out (prefixed with a # character) by default. Uncomment the lines by removing the # character.
if [ "$SERVICE" = "cli" ]; then
  if [ -z "$DEBUG" ]; then
    export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xmx12288m -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+useParNewGC -XX:-useGCOverheadLimit"
  else
    export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xmx12288m -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-useGCOverheadLimit"
  fi
fi

export HADOOP_HEAPSIZE=2048

You can choose whether to use the Concurrent Collector or the New Parallel Collector for garbage collection, by passing -XX:+useParNewGC or -XX:+useConcMarkSweepGC in the HADOOP_OPTS lines above, and you can tune the garbage collection overhead limit by setting -XX:-useGCOverheadLimit. To enable the garbage collection overhead limit, remove the setting or change it to -XX:+useGCOverheadLimit.

Configuration for WebHCat

If you want to use WebHCat, you need to set the PYTHON_CMD variable in /etc/default/hive-webhcat-server after installing Hive; for example:
export PYTHON_CMD=/usr/bin/python