Using CDH with Isilon Storage

EMC Isilon is a storage service with a distributed file system that can used in place of HDFS to provide storage for CDH services.

Continue reading:

Supported Versions
Differences between Isilon HDFS and CDH HDFS
Preliminary Steps on the Isilon Service
Installing Cloudera Manager with Isilon
Installing a Secure Cluster with Isilon
Upgrading a Cluster with Isilon
Isilon Storage
- Required Configurations

Supported Versions

The following versions of Cloudera and Isilon products are supported:

CDH Version	Isilon OneFS Version

5.2	7.2.x releases starting with 7.2.0.1 and higher Cloudera recommends 7.2.1.1
5.3	7.2.x releases starting with 7.2.0.2 and higher Cloudera recommends 7.2.1.1
5.4	7.2.x releases starting with 7.2.0.3 and higher Cloudera recommends 7.2.1.1
5.5	7.2.x releases starting with 7.2.0.3 and higher Cloudera recommends 7.2.1.1

Differences between Isilon HDFS and CDH HDFS

The following features of HDFS are not implemented with Isilon OneFS:

HDFS caching
HDFS encryption
HDFS ACLs

Preliminary Steps on the Isilon Service

Before installing a Cloudera Manager cluster to use Isilon storage, perform the following steps on the Isilon OneFS system. For detailed information on setting up Isilon OneFS for Cloudera Manager, see the Isilon documentation at https://community.emc.com/docs/DOC-39529.

Create an Isilon access zone with HDFS support.
```
Example:
/ifs/your-access-zone/hdfs
```
Note: The above is simply an example; the HDFS root directory does not have to begin with ifs or end with hdfs.
Create two directories that will be used by all CDH services:
1. Create a tmp directory in the access zone.
  - Create supergroup group and hdfs user.
  - Create a tmp directory and set ownership to hdfs:supergroup, and permissions to 1777.
```
Example:
cd hdfs_root_directory
isi_run -z zone_id mkdir tmp
isi_run -z zone_id chown hdfs:supergroup tmp
isi_run -z zone_id chmod 1777 tmp
```
2. Create a user directory in the access zone and set ownership to hdfs:supergroup, and permissions to 755
```
Example:
cd hdfs_root_directory
isi_run -z zone_id mkdir user
isi_run -z zone_id chown hdfs:supergroup user
isi_run -z zone_id chmod 755 user
```
Create the service-specific users, groups, or directories for each CDH service you plan to use. Create the directories under the access zone you have created.
Note: Many of the values provided in the examples below are default values in Cloudera Manager and must match the Cloudera Manager configuration settings. The format for the examples is: dir user:group permission . Create the directories below under the access zone you have created, for example, /ifs/ your-access-zone /hdfs/
- ZooKeeper: nothing required.
- HBase
  - Create hbase group with hbase user.
  - Create the root directory for HBase:
```
Example:
hdfs_root_directory/hbase hbase:hbase 755
```
- YARN (MR2)
  - Create mapred group with mapred user.
  - Create history directory for YARN:
```
Example:
hdfs_root_directory/user/history mapred:hadoop 777
```
  - Create the remote application log directory for YARN:
```
Example:
hdfs_root_directory/tmp/logs mapred:hadoop 775
```
- Oozie
  - Create oozie group with oozie user.
  - Create the user directory for Oozie:
```
Example:
hdfs_root_directory/user/oozie oozie:oozie 775
```
- Flume
  - Create flume group with flume user.
  - Create the user directory for Flume:
```
Example:
hdfs_root_directory/user/flume flume:flume 775
```
- Hive
  - Create hive group with hive user.
  - Create the user directory for Hive:
```
Example:
hdfs_root_directory/user/hive hive:hive 775
```
  - Create the warehouse directory for Hive:
```
Example:
hdfs_root_directory/user/hive/warehouse hive:hive 1777
```
  - Create a temporary directory for Hive:
```
Example:
hdfs_root_directory/tmp/hive hive:supergroup 777
```
- Solr
  - Create solr group with solr user.
  - Create the data directory for Solr:
```
Example:
hdfs_root_directory/solr solr:solr 775
```
- Sqoop
  - Create sqoop group with sqoop2 user.
  - Create the user directory for Sqoop:
```
Example:
hdfs_root_directory/user/sqoop2 sqoop2:sqoop 775
```
- Hue
  - Create hue group with hue user.
  - Create sample group with sample user.
- Spark
  - Create spark group with spark user.
  - Create the user directory for Spark:
```
Example:
hdfs_root_directory/user/spark spark:spark 751
```
  - Create application history directory for Spark:
```
Example:
hdfs_root_directory/user/spark/applicationHistory spark:spark 1777
```

Once the users, groups, and directories are created in Isilon OneFS, you are ready to install Cloudera Manager.

Installing Cloudera Manager with Isilon

To install Cloudera Manager follow the instructions provided in Installation.

The simplest installation procedure, suitable for development or proof of concept, is Installation Path A, which uses embedded databases that are installed as part of the Cloudera Manager installation process.
For production environments, Installation Path B - Manual Installation Using Cloudera Manager Packages describes configuring external databases for Cloudera Manager and CDH storage needs.

If you choose parcel installation on the Cluster Installation screen, the installation wizard will point to the latest parcels of CDH available.

On the installation wizard's Cluster Setup page, choose Custom Services, and choose the services you want installed in the cluster. Be sure to choose Isilon among the selected services, do not select the HDFS service, and do not check Include Cloudera Navigator at the bottom of the Cluster Setup page. Also, on the Role Assignments page, be sure to specify the hosts that will serve as gateway roles for the Isilon service. You can add gateway roles to one, some, or all nodes in the cluster.

Installing a Secure Cluster with Isilon

To set up a secure cluster with Isilon using Kerberos, perform the following steps:

Create an unsecure Cloudera Manager cluster as described above in Installing Cloudera Manager with Isilon.
Follow the Isilon documentation to enable Kerberos for your access zone: https://community.emc.com/docs/DOC-39529. This includes adding a Kerberos authentication provider to your Isilon access zone.
Add the following proxy users in Isilon if your Cloudera Manager cluster includes the corresponding CDH services. The procedure for configuring proxy users is described in the Isilon documentation, https://community.emc.com/docs/DOC-39529.
- proxy user hdfs for hdfs user.
- proxy user mapred for mapred user.
- proxy user hive for hive user.
- proxy user impala for impala user.
- proxy user oozie for oozie user
- proxy user flume for flume user
- proxy user hue for hue user
Follow the Cloudera Manager documentation for information on configuring a secure cluster with Kerberos: Configuring Authentication in Cloudera Manager.

Upgrading a Cluster with Isilon

To upgrade CDH and Cloudera Manager in a cluster that uses Isilon:

If required, upgrade OneFS to a version compatible with the version of CDH to which you are upgrading. For compatibility information, see Product Compatibility Matrix for EMC Isilon. For OneFS upgrade instructions, see the EMC Isilon documentation.
(Optional) Upgrade Cloudera Manager. See Upgrading Cloudera Manager.
The Cloudera Manager minor version must always be equal to or greater than the CDH minor version because older versions of Cloudera Manager may not support features in newer versions of CDH. For example, if you want to upgrade to CDH 5.4.8 you must first upgrade to Cloudera Manager 5.4 or higher.
Upgrade CDH. See Upgrading CDH and Managed Services Using Cloudera Manager.

Using Impala with Isilon Storage

You can use Impala to query data files that reside on EMC Isilon storage devices, rather than in HDFS. This capability allows convenient query access to a storage system where you might already be managing large volumes of data. The combination of the Impala query engine and Isilon storage is certified on CDH 5.4.4 through CDH 5.15.

Because the EMC Isilon storage devices use a global value for the block size rather than a configurable value for each file, the PARQUET_FILE_SIZE query option has no effect when Impala inserts data into a table or partition residing on Isilon storage. Use the isi command to set the default block size globally on the Isilon device. For example, to set the Isilon default block size to 256 MB, the recommended size for Parquet data files for Impala, issue the following command:

isi hdfs settings modify --default-block-size=256MB

The typical use case for Impala and Isilon together is to use Isilon for the default filesystem, replacing HDFS entirely. In this configuration, when you create a database, table, or partition, the data always resides on Isilon storage and you do not need to specify any special LOCATION attribute. If you do specify a LOCATION attribute, its value refers to a path within the Isilon filesystem. For example:

-- If the default filesystem is Isilon, all Impala data resides there
-- and all Impala databases and tables are located there.
CREATE TABLE t1 (x INT, s STRING);

-- You can specify LOCATION for database, table, or partition,
-- using values from the Isilon filesystem.
CREATE DATABASE d1 LOCATION '/some/path/on/isilon/server/d1.db';
CREATE TABLE d1.t2 (a TINYINT, b BOOLEAN);

Impala can write to, delete, and rename data files and database, table, and partition directories on Isilon storage. Therefore, Impala statements such as CREATE TABLE, DROP TABLE, CREATE DATABASE, DROP DATABASE, ALTER TABLE, and INSERT work the same with Isilon storage as with HDFS.

When the Impala spill-to-disk feature is activated by a query that approaches the memory limit, Impala writes all the temporary data to a local (not Isilon) storage device. Because the I/O bandwidth for the temporary data depends on the number of local disks, and clusters using Isilon storage might not have as many local disks attached, pay special attention on Isilon-enabled clusters to any queries that use the spill-to-disk feature. Where practical, tune the queries or allocate extra memory for Impala to avoid spilling. Although you can specify an Isilon storage device as the destination for the temporary data for the spill-to-disk feature, that configuration is not recommended due to the need to transfer the data both ways using remote I/O.

When tuning Impala queries on HDFS, you typically try to avoid any remote reads. When the data resides on Isilon storage, all the I/O consists of remote reads. Do not be alarmed when you see non-zero numbers for remote read measurements in query profile output. The benefit of the Impala and Isilon integration is primarily convenience of not having to move or copy large volumes of data to HDFS, rather than raw query performance. You can increase the performance of Impala I/O for Isilon systems by increasing the value for the num_remote_hdfs_io_threads configuration parameter, in the Cloudera Manager user interface for clusters using Cloudera Manager, or through the --num_remote_hdfs_io_threads startup option for the impalad daemon on clusters not using Cloudera Manager.

For information about managing Isilon storage devices through Cloudera Manager, see Using CDH with Isilon Storage.

Required Configurations

Specify the following configurations in Cloudera Manager on the Clusters > Isilon Service > Configuration tab:

In HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml hdfs-site.xml and the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml properties for the Isilon service, set the value of the dfs.client.file-block-storage-locations.timeout.millis property to 10000.
In the Isilon Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml property for the Isilon service, set the value of the hadoop.security.token.service.use_ip property to FALSE.
If you see errors that reference the .Trash directory, make sure that the Use Trash property is selected.

Configuring Centralized Cache Management in HDFS

Managing Hive