Configuring Azure Data Lake Store to Use with CDH

Microsoft Azure Data Lake Store (ADLS) is a massively scalable distributed file system that can be accessed through an HDFS-compatible API. ADLS acts as a persistent storage layer for CDH clusters running on Azure. In contrast to Amazon S3, ADLS more closely resembles native HDFS behavior, providing consistency, file directory structure, and POSIX-compliant ACLs. See the ADLS documentation for conceptual details.

CDH 5.11 supports using ADLS as a storage layer for MapReduce2 (MRv2 or YARN), Hive on MRv2, Spark 2.1, and Spark 1.6. Use the following steps to set up a data store to use with these CDH components.

Setting up ADLS to Use with CDH

  1. To create your ADLS account, see the Microsoft documentation.
  2. Create the service principal in the Azure portal. See the Microsoft documentation on creating a service principal.

  3. Grant the service principal permission to access the ADLS account. See the Microsoft documentation on Authorization and access control. Review the section, "Using ACLs for operations on file systems" for information about granting the service principal permission to access the account.

    You can skip the section on RBAC (role-based access control) because RBAC is used for management and you only need data access.

  4. In Cloudera Manager, enter the following configuration properties into the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml and save the changes.

    <property>
       <name>dfs.adls.oauth2.access.token.provider.type</name>
       <value>ClientCredential</value>
    </property>
    <property>
       <name>dfs.adls.oauth2.client.id</name>
       <value>your_client_id_from_step_2</value>
    </property>
    <property>
       <name>dfs.adls.oauth2.credential</name>
       <value>your_client_secret_from_step_2</value>
    </property>
    <property>
       <name>dfs.adls.oauth2.refresh.url</name>
       <value>refresh_URL_from_step_2</value>
    </property>
    
  5. In Cloudera Manager, click Restart Stale Services so the cluster can read the new configuration information.
  6. Test your configuration by running the following command that lists files in your ADLS account:

    hadoop fs -ls adl://your_account.azuredatalakestore.net/

    If your configuration is correct, this command lists the files in your account.

  7. After successfully testing your configuration, you can access the ADLS account from MRv2, Hive on MRv2, or Spark 1.6 by using the following URI:

    adl://your_account.azuredatalakestore.net

ADLS Trash Folder Behavior

If the fs.trash.interval property is set to a value other than zero on your cluster and you do not specify the -skipTrash flag with your rm command when you remove files, the deleted files are moved to the trash folder in your ADLS account. The trash folder in your ADLS account is located at adl://your_account.azuredatalakestore.net/user/user_name/.Trash/current/. For more information about HDFS trash, see Configuring HDFS Trash.

User and Group Names Displayed as GUIDs

By default ADLS user and group names are displayed as GUIDs. For example, you receive the following output for these Hadoop commands:
$hadoop fs -put /etc/hosts adl://your_account.azuredatalakestore.net/one_file
$hadoop fs -ls adl://your_account.azuredatalakestore.net/one_file  
-rw-r--r--  1 94c1b91f-56e8-4527-b107-b52b6352320e cdd5b9e6-b49e-4956-be4b-7bd3ca314b18   273
2017-04-11 16:38 adl://your_account.azuredatalakestore.net/one_file
To display user-friendly names, set the property adl.feature.ownerandgroup.enableupn to true in the core-site.xml file or at the command line. When this property is set to true the -ls command returns the following output:
$hadoop fs -ls adl://your_account.azuredatalakestore.net/one_file
-rw-r--r--  1 YourADLSApp your_login_app    273 2017-04-11 16:38
adl://your_account.azuredatalakestore.net/one_file