Tutorial: Clusters and Jobs on Azure

This tutorial walks you through using the Altus console and CLI to create Altus Data Engineering clusters and submit jobs in Altus. The tutorial uses publicly available data that show the usage of Medicare procedure codes.

You must set up an ADLS account to store the tutorial job examples and input data and to write output data. Cloudera has created a jar file that contains the job examples and input files that you need to successfully complete the tutorial. Before you start the exercises, upload the files to the ADLS account that you set up for the tutorial files.

The tutorial has the following sections:
Prerequisites
To use this tutorial, you must have an Altus user account and the roles required to create clusters and run jobs in Altus.
Sample Jar File Upload
Upload the files you need to complete the tutorial.
Altus Console Login
Log in to the Altus console to perform the exercises in this tutorial.
Exercise 1: Installing the Altus Client
Learn how to install the Altus client and register an access key to use the CLI.
Exercise 2: Creating a Spark Cluster and Submitting Spark Jobs
Learn how to create a cluster with a Spark service and submit a Spark job using the Altus console and the CLI. This exercise provides instructions on how to create a SOCKS proxy and view the cluster and monitor the job in Cloudera Manager. It also shows you how to delete the cluster on the console.
Exercise 3: Creating a Hive Cluster and Submitting Hive Jobs
Learn how to create a cluster with a Hive service and submit a group of Hive jobs using the Altus console and the CLI. This exercise also walks you through the process of creating a SOCKS proxy and accessing Cloudera Manager. It also shows you how to delete the cluster on the console.

Prerequisites

Before you start the tutorial, ensure that you have access to resources in your Azure subscription and an Altus user account with permission to create clusters and run jobs in Altus.

The following are prerequisites for the tutorial:
  • Altus user account, environment, and roles. An Altus user account allows you to log in to the Altus console and perform the exercises in the tutorial. An Altus administrator must assign an Altus environment to your user account so that you have access to resources in your Azure subscription. The Altus administrator must also assign roles to your user account to allow you to create clusters and run jobs in Altus.

    For more information about getting an Altus user account, see Getting Started in Altus.

  • Public key. When you create a cluster, provide an SSH public key that Altus can add to the cluster. You can then use the corresponding private key to access the cluster after the cluster is created.
  • Azure Data Lake Store (ADLS) account. Set up an ADLS account to store sample jobs and input data files for use in the tutorial. You also write job output to the same account. The ADLS account must be set up with permissions to allow read and write access when you run the Altus jobs.

    For more information about creating an ADLS account in Azure, see Get started with Azure Data Lake Store using the Azure portal.

Sample Files Upload

Cloudera provides jar files that contain the Altus job example files and the input files used in the tutorial. Before you start the tutorial, upload the jar file to your ADLS account so the job examples and data are available for your use. Use the Azure Cloud Shell to upload the file.

To upload the jar file to your ADLS account, complete the following steps:
  1. Follow the instructions in the Azure documentation to set up an Azure Cloud Shell with a bash environment.
  2. Run the following command to download the altus_adls_upload_examples.sh script:
    wget https://raw.githubusercontent.com/cloudera/altus-azure-tools/master/upload-examples/altus_adls_upload_examples.sh

    You use the script to upload the files that you need for the tutorials to your ADLS account.

  3. In the Azure Cloud Shell, follow the instructions in the Azure documentation to log in to Azure using the Azure CLI.

    The Azure CLI is installed with Azure Cloud Shell so you do not need to install it separately.

  4. Run the script to upload the tutorial files to your ADLS account:
    bash ./altus_adls_upload_examples.sh --adls-account YourADLSaccountname --adls-path cloudera-altus-data-engineering-samples
  5. Verify that the tutorial examples and input data files are uploaded to your ADLS account in the Altus Data Engineering examples folder.

Altus Console Login

To access the Altus console, go to the following URL: https://console.altus.cloudera.com/.

Log in to Altus with your Cloudera user account. After you are authenticated, the Altus console displays your home page.

The Data Engineering section displays on the side navigation panel. If you have been assigned roles and an environment in Altus, you can click on Clusters and Jobs to create clusters and submit jobs as you follow the tutorial exercises.

Exercise 1: Installing the Altus Client

To use the Altus CLI, you must install the Altus client and configure the client with an access key.

Altus manages access to the Altus services so that only users with a registered access key can run commands to create clusters, submit jobs, or use SDX namespaces. Generate and register an access key with the Altus client to create a credentials file so that you do not need submit your access key with each command.

This exercise provides instructions to download and install the Altus client on Linux, generate a key, and run the CLI command to register the key.

For instructions to install the Altus client on Windows, see Installing the Altus Client on Windows.

To set up the Cloudera Altus client, complete the following tasks:
  1. Install the Altus client.
  2. Configure the Altus client with an access key.

Step 1. Install the Altus Client

To avoid conflicts with older versions of Python or other packages, Cloudera recommends that you install the Cloudera Altus client in a virtual environment. Use the virtualenv tool to create a virtual environment and install the client.

The following commands show how you can use pip to install the client on a virtual environment on Linux:
mkdir ~/altusclienv
virtualenv ~/altusclienv --no-site-packages
source ~/altusclienv/bin/activate 
~/altusclienv/bin/pip install altuscli 
To upgrade the client to the latest version, run the following command:
~/altusclienv/bin/pip install --upgrade altuscli 

After the client installation process is complete, run the following command to confirm that the Altus client is working:

If virtualenv is activated: altus --version

If virtualenv is not activated: ~/altusclienv/bin/altus --version

Step 2. Configure the Altus Client with the API Access Key

You use the Altus console to generate the access that you register with the client. Keep the window that displays the access key on the console open until you complete the key registration process.

To create and set up the client with a Cloudera Altus API access key:
  1. Sign in to the Cloudera Altus console:

    https://console.altus.cloudera.com/

  2. Click your user account name and select My Account.
  3. On the My Account page, click Generate Access Key.

    Altus creates the key and displays the information on the screen. The following image shows an example of an Altus API access key as displayed on the Altus console:


  4. On the command line, run the following command to configure the client with the access key:
    altus configure
  5. Enter the following information at the prompt:
    • Altus Access key. Copy and paste the access key ID that you generated in the Cloudera Altus console.
    • Altus Private key. Copy and paste the private key that you generated in the Cloudera Altus console. The private key is a very long string of characters. Make sure that you enter the full string.

    The configuration utility creates the following file to store your user credentials: ~/.altus/credentials

  6. To verify that the credentials were created correctly, run the following command:
    altus iam get-user

    The command displays your Altus client credentials.

  7. After the credentials file is created, you can go back to the Cloudera Altus console and click OK to exit the access key window.

Exercise 2: Creating a Spark Cluster and Submitting Spark Jobs

This exercise shows you how to create a cluster with a Spark service on the Altus console and submit a Spark job on the console and the command line. It also shows you how to create a SOCKS proxy and access the cluster and view the progress of the job on Cloudera Manager.

In this exercise, you complete the following tasks:
  1. Create a cluster with a Spark service on the console.
  2. Submit a Spark job on the console.
  3. Create a SOCKS proxy to access the Spark cluster on Cloudera Manager.
  4. View the Spark cluster and verify the Spark job output.
  5. Submit a Spark job using the CLI.
  6. Terminate the Spark cluster

Creating a Spark Cluster on the Console

You must be logged in to the Altus console to perform this task.

Note that it can take a while for Altus to complete the process of creating a cluster.

To create a cluster on the console:
  1. In the Data Engineering section of the side navigation panel, click Clusters.
  2. On the Clusters page, click Create Cluster.
  3. Create a cluster with the following configuration:
    Property Description
    Cluster Name To help you easily identify your cluster, use your first initial and last name as prefix for the cluster name. This tutorial uses the cluster name mjones-spark-tutorial as an example.
    Service Type Spark 2.x
    CDH Version CDH 5.14
    Environment Name of the Altus environment to which you have been given access for this tutorial. If you do not know which Altus environment to select, check with your Altus administrator.
    Node Configuration For the Worker node configuration, set the Number of Nodes to 3.

    Leave the rest of the node properties with their default setting.

    Credentials Configure your access credentials to Cloudera Manager:
    SSH Public Key
    If you have your public key in a file, select File Upload and choose the key file. If you have the key available for pasting on screen, select Direct Input to enter the full key code.
    Cloudera Manager User
    Set both the user name and password to guest.
  4. Verify that all required fields are set and click Create Cluster.

    The Altus Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page, the new cluster displays at the top of the list of clusters.

Submitting a Spark Job

Submit a Spark job to run on the cluster you created in the previous task.

To submit a Spark job on the console:
  1. In the Data Engineering section of the side navigation panel, click Jobs.
  2. Click Submit Jobs.
  3. On the Job Settings page, select Single job.
  4. Select the Spark job type.
  5. Create a Spark job with the following configuration:
    Property Description
    Job Name Set the job name to Spark Medical Example.
    Main Class Set the main class to com.cloudera.altus.sample.medicare.transform
    Jars Use the tutorial jar file: adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/program/altus-sample-medicare-spark2x.jar
    Application Arguments Set the application arguments to the ADLS path to use for job input and output.

    Add the tutorial ADLS path for the job input: adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/input/

    Click + and add the ADLS path for the job output: adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/output

    Cluster Settings Use an existing cluster and select the cluster that you created in the previous task.
    The following figure shows the Submit Jobs page with the settings for this tutorial:

  6. Verify that all required fields are set and click Submit Jobs.

    The Altus Data Engineering service submits the job to run on the selected cluster in your AWS account.

Creating a SOCKS Proxy for the Spark Cluster

Use the Altus CLI to create a SOCKS proxy to log in to Cloudera Manager and view the cluster and progress of the job.

To create a SOCKS proxy to access Cloudera Manager:
  1. In the Data Engineering section of the side navigation panel, click Clusters.
  2. On the Clusters page, find the cluster on which you submitted the job and click the cluster name.
  3. On the cluster detail page, click View SOCKS Proxy CLI Command.

    Altus displays the command that you can use to create a SOCKS proxy to log in to the Cloudera Manager instance for the Spark cluster that you created.


  4. Click Copy.
  5. On a terminal window, paste the command.
  6. Modify the command to use the name of the cluster you created and your private key and run the command:
    altus dataeng socks-proxy --cluster-name "YourClusterName" --ssh-private-key="YourPrivateKey" --open-cloudera-manager="yes"

    The Cloudera Manager Admin console opens in a Chrome browser.

Viewing the Cluster and Verifying the Spark Job Output

Log in to Cloudera Manager with the guest user account that you set up when you created the cluster.

To view the cluster and monitor the job on the Cloudera Manager Admin console:
  1. Log in to Cloudera Manager using guest as the account name and password.
  2. On the Home page, click Clusters on the top navigation bar.
  3. On the cluster window, select YARN Applications.

    The following screenshots show the cluster services and workload information that you can view on the Cloudera Manager Admin console:





When your Spark job completes, you can view the output of the Spark job in the ADLS account that you specified for your job output. The Spark job creates the following files in your ADLS output folder:
  • Success (0 bytes)
  • part-00000 (65.5 KB)
  • part-00001 (69.5 KB)

Creating a Spark Job using the CLI

You can submit the same Spark job to run on the same cluster using the CLI. If you want to view the cluster and monitor the job on Cloudera Manager, stay logged in to Cloudera Manager.

To submit a Spark job using the CLI, run the following command:
altus dataeng submit-jobs \
--cluster-name FirstInitialLastName-tutorialcluster \
--jobs '{ "sparkJob": {
                "jars": [
                    "adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/program/altus-sample-medicare-spark2x.jar"
                ],
                "mainClass": "com.cloudera.altus.sample.medicare.transform",
                "applicationArguments": [
                   "adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/input/",
                   "adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/output"
                ]
            }}'

To view the workload summary, go to the Cloudera Manager console and click Clusters > SPARK_ON_YARN-1. Cloudera Manager displays the same workload summary for this job as for the job that you submitted through the console.

To verify the output, go to ADLS account that you specified for your job output and verify that it contains the files created by the Spark job:
  • Success (0 bytes)
  • part-00000 (65.5 KB)
  • part-00001 (69.5 KB)

Terminating the Cluster

This task shows you how to terminate the cluster that you created for this tutorial.

To terminate the cluster on the Altus console:
  1. On the Altus console, go to the Data Engineering section of the side navigation panel and click Clusters.
  2. On the Clusters page, click the name of the cluster that you created for this tutorial.
  3. On the Cluster details page, review the cluster information to verify that it is the cluster that you want to terminate.
  4. Click Actions and select Delete Cluster.
  5. Click OK to confirm that you want to terminate the cluster.

Exercise 3: Creating a Hive Cluster and Submitting Hive Jobs

This exercise shows you how to create a cluster with a Hive service on the Altus console and submit Hive jobs on the console and the command line. It also shows you how to create a SOCKS proxy and access the cluster and view the progress of the jobs on Cloudera Manager.

In this exercise, you complete the following tasks:
  1. Create a cluster with a Hive service on the console.
  2. Submit a group of Hive jobs on the console.
  3. Create a SOCKS proxy to access the Hive cluster on Cloudera Manager
  4. View the Hive cluster and verify the Hive job output.
  5. Submit a group of Hive jobs using the CLI.
  6. Terminate the Hive cluster

Creating a Hive Cluster on the Console

You must be logged in to the Altus console to perform this task.

Note that it can take a while for Altus to complete the process of creating a cluster.

To create a cluster with a Hive service on the console:
  1. In the Data Engineering section of the side navigation panel, click Clusters.
  2. On the Clusters page, click Create Cluster.
  3. Create a cluster with the following configuration:
    Property Description
    Cluster Name To help you easily identify your cluster, use your first initial and last name as prefix for the cluster name. This tutorial uses the cluster name mjones-hive-tutorial as an example.
    Service Type Hive
    CDH Version CDH 5.14
    Environment Name of the Altus environment to which you have been given access for this tutorial. If you do not know which Altus environment to select, check with your Altus administrator.
    Node Configuration For the Worker node configuration, set the Number of Nodes to 3.

    Leave the rest of the node properties with their default setting.

    Credentials Configure your access credentials to Cloudera Manager:
    SSH Public Key
    If you have your public key in a file, select File Upload and choose the key file. If you have the key available for pasting on screen, select Direct Input to enter the full key code.
    Cloudera Manager User
    Set both the user name and password to guest.
  4. Verify that all required fields are set and click Create Cluster.

    The Altus Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page, the new cluster displays at the top of the list of clusters.

Submitting a Hive Job Group

Submit multiple Hive jobs as a group to run on the cluster that you created in the previous step.

To submit a job group on the console:
  1. In the Data Engineering section of the side navigation panel, click Jobs.
  2. Click Submit Jobs.
  3. On the Job Settings page, select Group of jobs.
  4. Select the Hive job type.
  5. Set the Job Group Name to Hive Medical Example.
  6. Click Add Hive Job.
  7. Create a job with the following configuration:
    Property Description
    Job Name Set the job name to Create External Tables.
    Script Select Script Path and enter the following script name: adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/program/med-part1.hql
    Hive Script Parameters Select Hive Script Parameters and add the following variables and values:
    • HOSPITALS_PATH: adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/hospitals/
    • READMISSIONS_PATH: adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/readmissionsDeath/
    • EFFECTIVECARE_PATH: adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/effectiveCare/
    • GDP_PATH: adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/GDP/
    Action on Failure Select Interrupt Job Queue.
    The following figure shows the Add Job window with the settings for this job:

  8. Click OK to add the job to the group.

    On the Submit Jobs page, Altus adds the Hive Medical Example job to the list of jobs in the group.

  9. Click Add Hive Job.
  10. Create a job with the following configuration:
    Property Description
    Job Name Set the job name to Clean Data.
    Script Select Script Path and enter the following script name: adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/program/med-part2.hql
    Action on Failure Select Interrupt Job Queue.
    The following figure shows the Add Job window with the settings for this job:

  11. Click OK.

    On the Submit Jobs page, Altus adds the Clean Data job to the list of jobs in the group.

  12. Click Add Hive Job.
  13. Create a job with the following configuration:
    Property Description
    Job Name Set the job name to Write Output.
    Script Select Script Path and enter the following script name: adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/program/med-part3.hql
    Hive Script Parameters Select Hive Script Parameters and add the ADLS folder that you created for the job output as a variable:
    • OUTPUT_DIR: adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/output/
    Action on Failure Select None.
    The following figure shows the Add Job window with the settings for this job:

  14. Click OK.

    On the Submit Jobs page, Altus adds the Write Output job to the list of jobs in the group.

  15. On the Cluster Settings section, select Use existing and select the Hive cluster you created for this exercise.

    The list of clusters displayed include only those clusters that can run Hive jobs.

  16. Click Submit Jobs to run the job group on your Hive cluster.

Creating a SOCKS Proxy for the Hive Cluster

Use the Altus CLI to create a SOCKS proxy to log in to Cloudera Manager and view the progress of the job.

To create a SOCKS proxy to access Cloudera Manager:
  1. In the Data Engineering section of the side navigation panel, click Clusters.
  2. On the Clusters page, find the cluster on which you submitted the Hive job group and click the cluster name.
  3. On the cluster detail page, click View SOCKS Proxy CLI Command.

    Altus displays the command that you can use to create a SOCKS proxy to log in to the Cloudera Manager instance for the Hive cluster that you created.


  4. Click Copy.
  5. On a terminal window, paste the command.
  6. Modify the command to use the name of the cluster you created and your private key and then run the following command:
    altus dataeng socks-proxy --cluster-name "YourClusterName" --ssh-private-key="YourPrivateKey" --open-cloudera-manager="yes"

    The Cloudera Manager Admin console opens in a Chrome browser.

Viewing the Hive Cluster and Verifying the Hive Job Output

Log in to Cloudera Manager with the guest user account that you set up when you created the Hive cluster.

To view the cluster and monitor the job on the Cloudera Manager Admin console:
  1. Log in to Cloudera Manager using guest as the account name and password.
  2. On the Home page, click Clusters on the top navigation bar.
  3. On the cluster window, select YARN Applications.

    The following screenshots show the cluster services and workload information that you can view on the Cloudera Manager Admin console:





  4. Click Clusters on the top navigation bar and select the default Hive service named HIVE-1. Then click HiveServer2 Web UI.

    The following screenshots show the workload information that you can view for the Hive service:





  5. When the jobs complete, go to the ADLS account that you specified for your job output and verify the file created by the Hive jobs.

    The Hive jobs create the following file in your ADLS output folder: 000000_0 (135.9 KB)

Creating a Hive Job Group using the CLI

You can submit the same group of Hive jobs to run on the same cluster using the CLI. If you want to view the cluster and monitor the job on Cloudera Manager, stay logged in to Cloudera Manager.

To submit a group of Hive jobs using the CLI, run the submit-jobs command and provide the list of jobs in the jobs parameter. Run it on the same cluster and use the same job group name.

Run the following command:
altus dataeng submit-jobs \
--cluster-name FirstInitialLastName-tutorialcluster \
--job-submission-group-name "Hive Medical Example" \
--jobs '[
         { "name": "Create External Tables",
           "failureAction": "INTERRUPT_JOB_QUEUE",
           "hiveJob": {
             "script": "adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/program/med-part1.hql",
             "params": ["HOSPITALS_PATH=adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/hospitals/", 
                        "READMISSIONS_PATH=adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/readmissionsDeath/", 
                        "EFFECTIVECARE_PATH=adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/effectiveCare/", 
                        "GDP_PATH=adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/GDP/"]
         }},
         { "name": "Clean Data",
           "failureAction": "INTERRUPT_JOB_QUEUE",
           "hiveJob": {
             "script": "adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/program/med-part2.hql"
         }},
         { "name": "Output Data",
           "failureAction": "NONE",
           "hiveJob": {
             "script": "adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/program/med-part3.hql",
             "params": ["outputdir=adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/output/"]
         }}
        ]'
You can go to the Cloudera Manager console to view the status of the Hive cluster and jobs:
  • To view the workload summary, click Clusters > SPARK_ON_YARN-1.
  • To view the job information, click Clusters > HIVE-1 > HiveServer2 Web UI.
Cloudera Manager displays the same workload summary and job queries for this job as for the job that you submitted through the console.

When the jobs complete, go to the ADLS account that you specified for your job output and verify the file created by the Hive jobs. The Hive job group creates the following file in your ADLS output folder: 000000_0 (135.9 KB)

Terminating the Hive Cluster

This task shows you how to terminate the cluster that you created for this tutorial.

To terminate the cluster on the Altus console:
  1. On the Altus console, go to the Data Engineering section of the side navigation panel and click Clusters.
  2. On the Clusters page, click the name of the cluster that you created for this tutorial.
  3. On the Cluster details page, review the cluster information to verify that it is the cluster that you want to terminate.
  4. Click Actions and select Delete Cluster.
  5. Click OK to confirm that you want to terminate the cluster.