Using CDH with Isilon Storage

Dell EMC Isilon is a storage service with a distributed filesystem that can used in place of HDFS to provide storage for CDH services.

Continue reading:

Supported Versions
Differences Between Isilon HDFS and CDH HDFS
Installing Cloudera Manager and CDH with Isilon
Upgrading a Cluster with Isilon
Isilon Storage
- Required Configurations
Configuring Replication with Kerberos and Isilon

Supported Versions

For Cloudera and Isilon compatibility information, see the product compatibility matrix for Product Compatibility for Dell EMC Isilon.

Differences Between Isilon HDFS and CDH HDFS

The following features of HDFS are not implemented with Isilon OneFS:

HDFS caching
HDFS encryption
HDFS ACLs

Installing Cloudera Manager and CDH with Isilon

For instructions on configuring Isilon and installing Cloudera Manager and CDH with Isilon, see the following EMC documentation:

Upgrading a Cluster with Isilon

To upgrade CDH and Cloudera Manager in a cluster that uses Isilon:

If required, upgrade OneFS to a version compatible with the version of CDH to which you are upgrading. See the product compatibility matrix for Product Compatibility for Dell EMC Isilon. For OneFS upgrade instructions, see the EMC Isilon documentation.
(Optional) Upgrade Cloudera Manager. See Upgrading Cloudera Manager.
Upgrade CDH. See Upgrading the CDH Cluster.

Using Impala with Isilon Storage

You can use Impala to query data files that reside on EMC Isilon storage devices, rather than in HDFS. This capability allows convenient query access to a storage system where you might already be managing large volumes of data. The combination of the Impala query engine and Isilon storage is certified on CDH 5.4.4 or higher.

Because the EMC Isilon storage devices use a global value for the block size rather than a configurable value for each file, the PARQUET_FILE_SIZE query option has no effect when Impala inserts data into a table or partition residing on Isilon storage. Use the isi command to set the default block size globally on the Isilon device. For example, to set the Isilon default block size to 256 MB, the recommended size for Parquet data files for Impala, issue the following command:

isi hdfs settings modify --default-block-size=256MB

The typical use case for Impala and Isilon together is to use Isilon for the default filesystem, replacing HDFS entirely. In this configuration, when you create a database, table, or partition, the data always resides on Isilon storage and you do not need to specify any special LOCATION attribute. If you do specify a LOCATION attribute, its value refers to a path within the Isilon filesystem. For example:

-- If the default filesystem is Isilon, all Impala data resides there
-- and all Impala databases and tables are located there.
CREATE TABLE t1 (x INT, s STRING);

-- You can specify LOCATION for database, table, or partition,
-- using values from the Isilon filesystem.
CREATE DATABASE d1 LOCATION '/some/path/on/isilon/server/d1.db';
CREATE TABLE d1.t2 (a TINYINT, b BOOLEAN);

Impala can write to, delete, and rename data files and database, table, and partition directories on Isilon storage. Therefore, Impala statements such as CREATE TABLE, DROP TABLE, CREATE DATABASE, DROP DATABASE, ALTER TABLE, and INSERT work the same with Isilon storage as with HDFS.

When the Impala spill-to-disk feature is activated by a query that approaches the memory limit, Impala writes all the temporary data to a local (not Isilon) storage device. Because the I/O bandwidth for the temporary data depends on the number of local disks, and clusters using Isilon storage might not have as many local disks attached, pay special attention on Isilon-enabled clusters to any queries that use the spill-to-disk feature. Where practical, tune the queries or allocate extra memory for Impala to avoid spilling. Although you can specify an Isilon storage device as the destination for the temporary data for the spill-to-disk feature, that configuration is not recommended due to the need to transfer the data both ways using remote I/O.

When tuning Impala queries on HDFS, you typically try to avoid any remote reads. When the data resides on Isilon storage, all the I/O consists of remote reads. Do not be alarmed when you see non-zero numbers for remote read measurements in query profile output. The benefit of the Impala and Isilon integration is primarily convenience of not having to move or copy large volumes of data to HDFS, rather than raw query performance. You can increase the performance of Impala I/O for Isilon systems by increasing the value for the ‑‑num_remote_hdfs_io_threads configuration parameter, in the Cloudera Manager user interface for clusters using Cloudera Manager, or through the ‑‑num_remote_hdfs_io_threads startup option for the impalad daemon on clusters not using Cloudera Manager.

For information about managing Isilon storage devices through Cloudera Manager, see Installing Cloudera Manager and CDH with Isilon.

Required Configurations

Specify the following configurations in Cloudera Manager on the Clusters > Isilon Service > Configuration tab:

In HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml hdfs-site.xml and the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml properties for the Isilon service, set the value of the dfs.client.file-block-storage-locations.timeout.millis property to 10000.
In the Isilon Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml property for the Isilon service, set the value of the hadoop.security.token.service.use_ip property to FALSE.
If you see errors that reference the .Trash directory, make sure that the Use Trash property is selected.

Configuring Replication with Kerberos and Isilon

If you plan to use replication between clusters that use Isilon storage and that also have enabled Kerberos, do the following:

Create a custom Kerberos Keytab and Kerberos principal that the replication jobs use to authenticate to storage and other CDH services. See Authentication.
In Cloudera Manager, select Administration > Settings.
Search for and enter values for the following properties:
- Custom Kerberos Keytab Location – Enter the location of the Custom Kerberos Keytab.
- Custom Kerberos Principal Name – Enter the principal name to use for replication between secure clusters.
When you create a replication schedule, enter the Custom Kerberos Principal Name in the Run As Username field. See Configuring Replication of HDFS Data and Configuring Replication of Hive/Impala Data.
Ensure that both the source and destination clusters have the same set of users and groups. When you set ownership of files (or when maintaining ownership), if a user or group does not exist, the chown command fails on Isilon. See Performance and Scalability Limitations
Cloudera recommends that you do not select the Replicate Impala Metadata option for Hive/Impala replication schedules. If you need to use this feature, create a custom principal of the form hdfs/hostname@realm or impala/hostname@realm.
Add the following property and value to the HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml and Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml properties:
```
hadoop.security.token.service.use_ip = false
```

If the replication MapReduce job fails with the an error similar to the following:

java.io.IOException: Failed on local exception: java.io.IOException:
  org.apache.hadoop.security.AccessControlException:
  Client cannot authenticate via:[TOKEN, KERBEROS];
  Host Details : local host is: "foo.mycompany.com/172.1.2.3";
  destination host is: "myisilon-1.mycompany.com":8020;

Set the Isilon cluster-wide time-to-live setting to a higher value on the destination cluster for the replication: Note that higher values may affect load balancing in the Isilon cluster by causing workloads to be less distributed. A value of 60 is a good starting point. For example:

isi networks modify pool subnet4:nn4 --ttl=60

You can view the settings for a subnet with a command similar to the following:

isi networks list pools --subnet subnet3 -v

Configuring Proxy Users to Access HDFS

Configuring Heterogeneous Storage in HDFS