Creating and Working with Clusters Using the CLI

You can use the Cloudera Altus client to create a cluster, view the properties of a cluster, or terminate a cluster. You can use the commands listed here as examples for how to use the Cloudera Altus commands.

When you run the command to create a cluster, you must specify the correct values for the service type. The following list describes the parameter values you need to pass to specify the service type. The list shows what type of job you can run with what type of service:
Service Type Job Type
HIVE hiveJob
HIVE_ON_SPARK hiveJob
SPARK sparkJob or pySparkJob
SPARK_1_6 sparkJob or pySparkJob
MAP_REDUCE_2 mr2Job
MULTI Multiple types of jobs: hiveJob, sparkJob, pySparkJob, mr2Job

The Multi service cluster supports Spark 2.x. It does not support Spark 1.6.

For more information about the commands available for the Altus Data Engineering service in the Altus client, run the following command:
altus dataeng help 

Creating a Cluster for AWS

You can use the following command to create a cluster:

altus dataeng create-aws-cluster
--service-type=ServiceType 
--workers-group-size=NumberOfWorkers 
--cluster-name=ClusterName 
--instance-type=InstanceType 
--cdh-version=CDHVersion 
--public-key=FullPath&FileNameOfPublicKeyFile
--environment-name=AltusEnvironmentName
--compute-workers-configuration='{"groupSize": NumberOfComputeWorkers, "useSpot": true, "bidUSDPerHr": BidPrice}'
Guidelines for using the create-aws-cluster command:
  • You must specify the service to include in the cluster. In the service-type parameter, use one of the following service names to specify the service in the cluster:
    • HIVE
    • HIVE_ON_SPARK
    • SPARK

      Use this service type for Spark 2.x.

    • SPARK_16

      Use this service type only if your application specifically requires Spark version 1.6. If you specify SPARK_16 in the service-type parameter, you must specify CDH511 in the cdh-version parameter.

    • MR2
    • MULTI

      A cluster with service type Multi allows you to run different types of jobs. You can run the following types of jobs in a Multi cluster: Spark2.x, Hive, MapReduce2.

  • You must specify the version of CDH to include in the cluster. In the cdh-version parameter, use one of the following version names to specify the CDH version:
    • CDH61
    • CDH515
    • CDH514
    • CDH513
    • CDH512
    • CDH511
  • The CDH version that you specify can affect the service that runs on the cluster:
    Spark 2.x or Spark 1.6
    For a Spark service type, you must select the CDH version that supports the selected Spark version. Altus supports the following combinations of CDH and Spark versions:
    • CDH 6.1 with Spark 2.4
    • CDH 5.12 or later 5.x versions with Spark 2.2
    • CDH 5.11 with Spark 2.1 or Spark 1.6
    Hive on Spark
    On CDH version 5.13 or later, dynamic partition pruning (DPP) is enabled for Hive on Spark by default. For details, see Dynamic Partition Pruning for Hive Map Joins in the Cloudera Enterprise documentation set.
  • The CDH version that you specify also affects the SDX namespace you can use with the cluster:
    CDH 6.1
    You can use a CDH 6.1 cluster only with a configured SDX namespace that points to version 6.1 of the Hive metastore and Sentry databases.
    CDH 5.x
    You can use a CDH 5.x cluster only with a configured SDX namespace that points to version 5.x of the Hive metastore and Sentry databases.
  • The public-key parameter requires the full path and file name of a .pub file prefixed with file://. For example: --public-key=file:///my/file/path/to/ssh/publickey.pub

    Altus adds the public key to the authorized_keys file on each node in the cluster.

  • You can use the cloudera-manager-username and cloudera-manager-password parameters to set the Cloudera Manager credentials. If you do not provide a username and password, the Data Engineering service generates a guest username and password for the Cloudera Manager user account.
  • The compute-workers-configuration parameter is optional. It adds compute worker nodes to the cluster in addition to worker nodes. Compute worker nodes run only computational processes. If you do not set the configuration for the compute workers, Altus creates a cluster with no compute worker nodes.
  • The response object for the create-aws-cluster command contains the credentials for the read-only account for the Cloudera Manager instance in the cluster. You must note down the credentials from this response since the credentials are not made available again.

Example: Creating a Cluster in AWS for a PySpark Job

This example shows how to create a cluster with a bootstrap script and run a PySpark job on the cluster. The bootstrap script installs a custom Python environment in which to run the job. The Python script file is available in the Cloudera Altus S3 bucket of job examples.

The following command creates a cluster with a bootstrap script and runs a job to implement an alternating least squares (ALS) algorithm.

altus dataeng create-aws-cluster 
    --environment-name=EnvironmentName 
    --service-type=SPARK 
    --workers-group-size=3 
    --cluster-name=ClusterName 
    --instance-type=m4.xlarge 
    --cdh-version=CDH512 
    --public-key YourPublicSSHKey  
    --instance-bootstrap-script='file:///PathToScript/bootstrapScript.sh' 
    --jobs '{
        "name": "PySpark ALS Job",
        "pySparkJob": {
            "mainPy": "s3a://cloudera-altus-data-engineering-samples/pyspark/als/als2.py",
            "sparkArguments" : "--executor-memory 1G --num-executors 2 --conf spark.pyspark.python=/tmp/pyspark-env/bin/python"
        }}'
The bootstrapScript.sh in this example creates a Python environment using the default Python version shipped with Altus and installs the NumPy package. It has the following content:
#!/bin/bash

target="/tmp/pyspark-env"
mypip="${target}/bin/pip"
echo "Provisioning pyspark environment ..."

virtualenv ${target}
${mypip} install numpy

if [ $? -eq 0 ]; then
    echo "Successfully installed new python environment at ${target}"
else
    echo "Failed to install custom python environment at ${target}"
fi

Creating a Cluster for Azure

You can use the following command to create a cluster:

altus dataeng create-azure-cluster
--service-type=ServiceType 
--workers-group-size=NumberOfWorkers 
--cluster-name=ClusterName 
--instance-type=InstanceType 
--cdh-version=CDHVersion 
--public-key=FullPath&FileNameOfPublicKeyfile
--environment-name=AltusEnvironmentName
Guidelines for using the create-azure-cluster command:
  • You must specify the service to include in the cluster. In the service-type parameter, use one of the following service names to specify the service in the cluster:
    • HIVE
    • HIVE_ON_SPARK
    • SPARK

      Use this service type for Spark 2.x.

    • SPARK_16

      Use this service type only if your application specifically requires Spark version 1.6.

    • MR2
    • MULTI

      A cluster with service type Multi allows you to run different types of jobs. You can run the following types of jobs in a Multi cluster: Spark2.x, Hive, MapReduce2.

  • Altus supports CDH 5.14, CDH 5.15, and CDH 6.1.

    Specify the following value for the cdh-version parameter: CDH514, CDH515, or CDH61

  • The CDH version that you select affects how you use the cluster:
    CDH 6.1
    • You can use a CDH 6.1 cluster only with a configured SDX namespace that points to version 6.1 of the Hive metastore and Sentry databases.
    • For clusters with CDH 6.1, Altus archives logs to ADLS Gen1 or Gen2, based on the folder you specify.
    CDH 5.x
    • You can use a CDH 5.x cluster only with a configured SDX namespace that points to version 5.x of the Hive metastore and Sentry databases.
    • For clusters with CDH 5.x, Altus archives logs to ADLS Gen1.
  • The ssh-public-key parameter requires the full path and file name of a .pub file prefixed with file://. For example: --public-key=file:///my/file/path/to/ssh/publickey.pub
  • You can use the cloudera-manager-username and cloudera-manager-password parameters to set the Cloudera Manager credentials. If you do not provide a username and password, the Altus Data Engineering service generates a guest username and password for the Cloudera Manager user account.
  • The response object for the create-azure-cluster command contains the credentials for the read-only account for the Cloudera Manager instance in the cluster. You must note the credentials from this response since the credentials are not made available again.

Example: Creating a Cluster in Azure for a PySpark Job

This example shows how to create a cluster with a bootstrap script and run a PySpark job on the cluster. The bootstrap script installs a custom Python environment in which to run the job.

Cloudera provides the job example files and input files that you need to run the jobs. To use the following example, set up an Azure Data Lake Store account with permissions to allow read and write access when you run the Altus jobs. Then run the script that Altus provides to upload the files to the ADLS account so the job files and data files are available for your use. For instructions on uploading the jar file, see Sample Files Upload.

The following command creates a cluster with a bootstrap script and runs a job to implement an alternating least squares (ALS) algorithm.

altus dataeng create-azure-cluster 
    --environment-name=EnvironmentName 
    --service-type=SPARK 
    --workers-group-size=3 
    --cluster-name=ClusterName 
    --instance-type=STANDARD_DS12_V2 
    --cdh-version=CDH513 
    --public-key YourPublicSSHKey  
    --instance-bootstrap-script='file:///PathToScript/bootstrapScript.sh' 
    --jobs '{
        "name": "PySpark ALS Job",
        "pySparkJob": {
            "mainPy": "adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/pyspark/als/als2.py",
            "sparkArguments" : "--executor-memory 1G --num-executors 2 --conf spark.pyspark.python=/tmp/pyspark-env/bin/python"
        }}'
The bootstrapScript.sh in this example creates a Python environment using the default Python version shipped with Altus and installs the NumPy package. It has the following content:
#!/bin/bash

target="/tmp/pyspark-env"
mypip="${target}/bin/pip"
echo "Provisioning pyspark environment ..."

virtualenv ${target}
${mypip} install numpy

if [ $? -eq 0 ]; then
    echo "Successfully installed new python environment at ${target}"
else
    echo "Failed to install custom python environment at ${target}"
fi

Viewing the Cluster Status

When you create a cluster, you can immediately check its status. If the cluster creation process is not yet complete, you can view information regarding the progress of cluster creation.

You can use the following command to display the status of a cluster and other information:

altus dataeng describe-cluster 
--cluster-name=ClusterName

cluster-name is a required parameter.

Deleting a Cluster

You can use the following command to delete a cluster:

altus dataeng delete-cluster 
--cluster-name=ClusterName