Setting Up HttpFS Using the Command Line

About HttpFS

Apache Hadoop HttpFS is a service that provides HTTP access to HDFS.

HttpFS has a REST HTTP API supporting all HDFS filesystem operations (both read and write).

Common HttpFS use cases are:

  • Read and write data in HDFS using HTTP utilities (such as curl or wget) and HTTP libraries from languages other than Java (such as Perl).
  • Transfer data between HDFS clusters running different versions of Hadoop (overcoming RPC versioning issues), for example using Hadoop DistCp.
  • Accessing WebHDFS using the Namenode WebUI port (default port 50070). Access to all data hosts in the cluster is required, because WebHDFS redirects clients to the datanode port (default 50075). If the cluster is behind a firewall, and you use WebHDFS to read and write data to HDFS, then Cloudera recommends you use the HttpFS server. The HttpFS server acts as a gateway. It is the only system that is allowed to send and receive data through the firewall.

HttpFS supports Hadoop pseudo-authentication, HTTP SPNEGO Kerberos, and additional authentication mechanisms using a plugin API. HttpFS also supports Hadoop proxy user functionality.

The webhdfs client file system implementation can access HttpFS using the Hadoop filesystem command (hadoop fs), by using Hadoop DistCp, and from Java applications using the Hadoop file system Java API.

The HttpFS HTTP REST API is interoperable with the WebHDFS REST HTTP API.

For more information about HttpFS, see Hadoop HDFS over HTTP.

Configuring HttpFS

When you install HttpFS from an RPM or Debian package, HttpFS creates all configuration, documentation, and runtime files in the standard Unix directories, as follows.

Type of File

Where Installed

Binaries

/usr/lib/hadoop-httpfs/

Configuration

/etc/hadoop-httpfs/conf/

Documentation

for SLES: /usr/share/doc/packages/hadoop-httpfs/

 

for other platforms: /usr/share/doc/hadoop-httpfs/

Data

/var/lib/hadoop-httpfs/

Logs

/var/log/hadoop-httpfs/

temp

/var/tmp/hadoop-httpfs/

PID file

/var/run/hadoop-httpfs/

Configuring the HDFS HttpFS Will Use

HttpFS reads the HDFS configuration from the core-site.xml and hdfs-site.xml files in /etc/hadoop/conf/. If necessary edit those files to configure the HDFS HttpFS will use.

Configuring the HttpFS Proxy User

Edit core-site.xml and define the Linux user that will run the HttpFS server as a Hadoop proxy user. For example:

<property>  
<name>hadoop.proxyuser.httpfs.hosts</name>  
<value>*</value>  
</property>  
<property>  
<name>hadoop.proxyuser.httpfs.groups</name>  
<value>*</value>  
</property>  

Then restart Hadoop to make the proxy user configuration active.

Configuring HttpFS with Kerberos Security

To configure HttpFS with Kerberos Security, see HttpFS Authentication.

Starting the HttpFS Server

After you have completed all of the required configuration steps, you can start HttpFS:

$ sudo service hadoop-httpfs start

If you see the message Server httpfs started!, status NORMAL in the httpfs.log log file, the system has started successfully.