Configuring CDH Components for Hue

To enable communication between the Hue Server and CDH components, you must make minor changes to your CDH installation by adding the properties described in this section to your CDH configuration files in /etc/hadoop-0.20/conf/ or /etc/hadoop/conf/. If you are installing on a cluster, make these configuration changes on each node in your cluster.

WebHDFS or HttpFS Configuration

Hue can use either of the following to access HDFS data:

  • WebHDFS provides high-speed data transfer with good locality because clients talk directly to the DataNodes inside the Hadoop cluster.
  • HttpFS is a proxy service appropriate for integration with external systems that are not behind the cluster's firewall.

Both WebHDFS and HttpFS use the HTTP REST API so they are fully interoperable, but Hue must be configured to use one or the other. For HDFS HA deployments, you must use HttpFS.

To configure Hue to use either WebHDFS or HttpFS, do the following steps:

  1. For WebHDFS only:
    1. Add the following property in hdfs-site.xml to enable WebHDFS in the NameNode and DataNodes:
      <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
      </property>
    2. Restart your HDFS cluster.
  2. Configure Hue as a proxy user for all other users and groups, meaning it may submit a request on behalf of any other user:

    WebHDFS. Add to core-site.xml:

    <!-- Hue WebHDFS proxy user setting -->
    <property>
      <name>hadoop.proxyuser.hue.hosts</name>
      <value>*</value>
    </property>
    <property>
      <name>hadoop.proxyuser.hue.groups</name>
      <value>*</value>
    </property>

    HttpFS. Verify that /etc/hadoop-httpfs/conf/httpfs-site.xml has the following configuration:

    <!-- Hue HttpFS proxy user setting -->
    <property>
      <name>httpfs.proxyuser.hue.hosts</name>
      <value>*</value>
    </property>
    <property>
      <name>httpfs.proxyuser.hue.groups</name>
      <value>*</value>
    </property>
    If the configuration is not present, add it to /etc/hadoop-httpfs/conf/httpfs-site.xml and restart the HttpFS daemon.
  3. Verify that core-site.xml has the following configuration:
    <property>  
    <name>hadoop.proxyuser.httpfs.hosts</name>  
    <value>*</value>  
    </property>  
    <property>  
    <name>hadoop.proxyuser.httpfs.groups</name>  
    <value>*</value>  
    </property>  
    If the configuration is not present, add it to /etc/hadoop/conf/core-site.xml and restart Hadoop.
  4. With root privileges, update hadoop.hdfs_clusters.default.webhdfs_url in hue.ini to point to the address of either WebHDFS or HttpFS.
    [hadoop]
    [[hdfs_clusters]]
    [[[default]]]
    # Use WebHdfs/HttpFs as the communication mechanism.
    WebHDFS:
    ...
    webhdfs_url=http://FQDN:50070/webhdfs/v1/

    HttpFS:

    ...
    webhdfs_url=http://FQDN:14000/webhdfs/v1/
      Note: If the webhdfs_url is uncommented and explicitly set to the empty value, Hue falls back to using the Thrift plugin used in Hue 1.x. This is not recommended.

MRv1 Configuration

Hue communicates with the JobTracker via the Hue plugin, which is a .jar file that you place in your MapReduce lib directory.

If your JobTracker and Hue Server are located on the same host, copy the file over. If you are currently using CDH3, your MapReduce library directory might be in /usr/lib/hadoop/lib.

$ cd /usr/share/hue
$ cp desktop/libs/hadoop/java-lib/hue-plugins-*.jar /usr/lib/hadoop-0.20-mapreduce/lib

If your JobTracker runs on a different host, scp the Hue plugins .jar file to the JobTracker host.

Add the following properties to mapred-site.xml:

<property>
  <name>jobtracker.thrift.address</name>
  <value>0.0.0.0:9290</value>
</property>
<property>
  <name>mapred.jobtracker.plugins</name>
  <value>org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin</value>
  <description>Comma-separated list of jobtracker plug-ins to be activated.</description>
</property>

You can confirm that the plugins are running correctly by tailing the daemon logs:

$ tail --lines=500 /var/log/hadoop-0.20-mapreduce/hadoop*jobtracker*.log | grep ThriftPlugin
2009-09-28 16:30:44,337 INFO org.apache.hadoop.thriftfs.ThriftPluginServer: Starting Thrift server
2009-09-28 16:30:44,419 INFO org.apache.hadoop.thriftfs.ThriftPluginServer:
Thrift server listening on 0.0.0.0:9290

Oozie Configuration

In order to run DistCp, Streaming, Pig, Sqoop, and Hive jobs in Job Designer or the Oozie Editor/Dashboard application, you must make sure the Oozie shared libraries are installed for the correct version of MapReduce (MRv1 or YARN). See Installing the Oozie ShareLib in Hadoop HDFS for instructions.

Hive Configuration

The Beeswax application helps you use Hive to query your data and depends on a Hive installation on your system. The Cloudera Impala application also depends on Hive.
  Note: When using Beeswax and Hive configured with the embedded metastore which is the default with Hue, the metastore DB should be owned by Hue (recommended) or writable to everybody:
sudo chown hue:hue -R /var/lib/hive/metastore/metastore_db 
sudo chmod -R 777 /var/lib/hive/metastore/metastore_db
If not, Beeswax won't start and Hue Beeswax app will show 'Exception communicating with Hive Metastore Server at localhost:8003'

Permissions

See File System Permissions in the Hive Installation section.

No Existing Hive Installation

Familiarize yourself with the configuration options in hive-site.xml. See Hive Installation. Having a hive-site.xml is optional but often useful, particularly on setting up a metastore. You can instruct Beeswax to locate it using the hive_conf_dir configuration variable.

Existing Hive Installation

In the Hue configuration file hue.ini, modify hive_conf_dir to point to the directory containing hive-site.xml.

Other Hadoop Settings

HADOOP_CLASSPATH

If you are setting $HADOOP_CLASSPATH in your hadoop-env.sh, be sure to set it in such a way that user-specified options are preserved. For example:

Correct:

# HADOOP_CLASSPATH=<your_additions>:$HADOOP_CLASSPATH

Incorrect:

# HADOOP_CLASSPATH=<your_additions>

This enables certain components of Hue to add to Hadoop's classpath using the environment variable.

hadoop.tmp.dir

If your users are likely to be submitting jobs both using Hue and from the same machine via the command line interface, they will be doing so as the hue user when they are using Hue and via their own user account when they are using the command line. This leads to some contention on the directory specified by hadoop.tmp.dir, which defaults to /tmp/hadoop-${user.name}. Specifically, hadoop.tmp.dir is used to unpack JARs in /usr/lib/hadoop. One work around to this is to set hadoop.tmp.dir to /tmp/hadoop-${user.name}-${hue.suffix} in the core-site.xml file:

<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop-${user.name}-${hue.suffix}</value>
</property>

Unfortunately, when the hue.suffix variable is unset, you'll end up with directories named /tmp/hadoop-user.name-${hue.suffix} in /tmp. Despite that, Hue will still work.

  Important:

The Beeswax Server writes into a local directory on the Hue machine that is specified by hadoop.tmp.dir to unpack its jars. That directory needs to be writable by the hue user, which is the default user who starts Beeswax Server, or else Beeswax Server will not start. You may also make that directory world-writable.