Running and Monitoring Jobs Using the CLI

Use the Cloudera Altus client to submit a job or view the properties of a job. You can use the commands listed here as examples for how to use the Altus commands to submit jobs in Altus.

For more information about the commands available in the Altus client, run the following command:
altus dataeng help 

Submitting a Spark Job

You can use the following command to submit a Spark job:
altus dataeng submit-jobs
 --cluster-name ClusterName 
 --jobs '{ "sparkJob": {
             "jars": [
               "PathAndFilenameOfJar1",
               "PathAndFilenameOfJar2"
             ]
        }}'

You can include the applicationArguments parameter to pass values to the main method and the sparkArguments parameter to specify Spark configuration settings. If you use the application and Spark arguments parameters, you must escape the list of arguments. Alternatively, you can put the arguments into a file and pass the path and file name with the arguments parameters.

You can also add the mainClass parameter to specify the entry point of your application.

Spark Job Examples

Pi Estimation Example
Spark provides a library of code examples that illustrate how Spark works. The following example uses the Pi estimation example from the Spark library to show how to submit Spark job using the Altus CLI.
You can use the following command to submit a Spark job to run the Pi estimation example:
altus dataeng submit-jobs \
 --cluster-name ClusterName \
 --jobs '{ "sparkJob": {
             "jars": [
               "local:///opt/cloudera/parcels/CDH/lib/spark/lib/spark-examples.jar"
             ],
             "sparkArguments" : "--executor-memory 1G --num-executors 2",
             "mainClass": "org.apache.spark.examples.SparkPi"
        }}'

The --cluster-name parameter requires the name of a Spark cluster.

Medicare Example
The following example processes publicly available data to show the usage of Medicare procedure codes. The Spark job is available in a Cloudera Altus S3 bucket of job examples and reads input data from the Cloudera Altus example S3 bucket. You can create an S3 bucket in your account to write output data.

To use the example, set up an S3 bucket in your AWS account and set permissions to allow write access when you run the job.

To run the Spark job example:
  1. Create a Spark cluster to run the job.
    You can create a cluster with Spark 2.x or Spark 1.6 service type. The version of the Spark service in the cluster must match the version of the Spark jar file:
    • For Spark 2.x, use the example jar file named altus-sample-medicare-spark2x.jar
    • For Spark 1.6, use the example jar file named altus-sample-medicare-spark1x.jar

    For more information about creating a cluster, see Creating a Cluster.

  2. Create an S3 bucket in your AWS account.
  3. Use the following command to submit the Medicare job:
    altus dataeng submit-jobs \
    --cluster-name ClusterName \
    --jobs '{ "sparkJob": {
                    "jars": [
                        "s3a://cloudera-altus-data-engineering-samples/spark/medicare/program/altus-sample-medicare-SparkVersion.jar"
                    ],
                    "mainClass": "com.cloudera.altus.sample.medicare.transform",
                    "applicationArguments": [
                       "s3a://cloudera-altus-data-engineering-samples/spark/medicare/input/",
                       "s3a://NameOfOutputS3Bucket/OutputPath/"
                    ]
                }}'

    The --cluster-name parameter requires the name of a cluster with a version of Spark that matches the version of the example jar file.

    The jars parameter requires the name of the jar file that matches the version of the Spark service in the cluster.

Submitting a Hive Job

You can use the following command to submit a Hive job:
altus dataeng submit-jobs
--cluster-name ClusterName
--jobs '{ "hiveJob": { 
             "script": "PathAndFilenameOfHQLScript" 
        }}'

You can also include the jobXml parameter to pass job configuration settings for the Hive job.

The following is an example of the content of a Hive job XML that you can use with the jobXml parameter:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <property>
    <name>hive.auto.convert.join</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.auto.convert.join.noconditionaltask.size</name>
    <value>20971520</value>
  </property>
  <property>
    <name>hive.optimize.bucketmapjoin.sortedmerge</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.smbjoin.cache.rows</name>
    <value>10000</value>
  </property>
  <property>
    <name>mapred.reduce.tasks</name>
    <value>-1</value>
  </property>
  <property>
    <name>hive.exec.reducers.max</name>
    <value>1099</value>
  </property>
</configuration>

Hive Job Example

The following example of a Hive job reads data from a CSV file and writes the data to an S3 bucket in the Cloudera AWS account. It then writes the same data, with the commas changed to colons, to an S3 bucket in your AWS account.

To use the example, set up an S3 bucket in your AWS account and set permissions to allow write access when you run the example Hive script.

To run the Hive job example:
  1. Create a cluster to run the Hive job.

    You can run a Hive job on a Hive on MapReduce or Hive on Spark cluster. For more information about creating a cluster, see Creating a Cluster .

  2. Create an S3 bucket in your AWS account.
  3. Create a Hive script file on your local drive.

    This example uses the file name hiveScript.hql.

  4. Copy and paste the following script into the file:
    DROP TABLE input;
    DROP TABLE output;
    
    CREATE EXTERNAL TABLE input(f1 STRING, f2 STRING, f3 STRING)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    LOCATION 's3a://cloudera-altus-data-engineering-samples/hive/data/';
    
    CREATE TABLE output(f1 STRING, f2 STRING, f3 STRING)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ':'
    STORED AS TEXTFILE
    LOCATION 's3a://NameOfOutputS3Bucket/OutputPath/';
    
    INSERT OVERWRITE TABLE output SELECT * FROM input ORDER BY f1;
  5. Modify the script and replace the name and path of the output S3 bucket with the name and path of the S3 bucket you created in your AWS account.
  6. Run the following command:
    altus dataeng submit-jobs \
    --cluster-name=ClusterName \
    --jobs '{ "hiveJob": { 
                 "script": "PathToHiveScript/hiveScript.hql" 
            }}'

    The --cluster-name parameter requires the name of a Hive or Hive on Spark cluster.

    The script parameter requires the absolute path and file name of the script file prefixed with file://.

    For example: --jobs '{ "hiveJob": { "script": "file:///file/path/to/my/hiveScript.hql" }}'

Submitting a MapReduce Job

You can use the following command to submit a MapReduce job:
altus dataeng submit-jobs
 --cluster-name ClusterName 
 --jobs '{ "mr2Job": {
             "mainClass": "main.class.file",
             "jars": [
               "PathAndFilenameOfJar1",
               "PathAndFilenameOfJar2"
             ]
        }}'

Altus uses Oozie to run MapReduce2 job. When you submit a MapReduce2 job in Altus, Oozie launches a Java action to process the MapReduce2 job request. You can specify configuration settings for your job in an XML configuration file. To load the Oozie configuration settings into the MapReduce2 job, load the jo XML file into the Java main class of the MapReduce2 application.

For example, the following code snippet from a MapReduce2 application shows the oozie.action.conf.xml being loaded into the application:
public int run(String[] args) throws Exception {
  Job job = Job.getInstance(loadJobConfiguration(), "wordcount");
  ... 
  // Launch MR2 Job
  ...
}

private Configuration loadJobConfiguration() {
  String ooziePreparedConfig = System.getProperty("oozie.action.conf.xml");
  if (ooziePreparedConfig != null) {
    // Oozie collects hadoop configs with job.xml into a single file.
    // So default config is not needed.
    Configuration actionConf = new Configuration(false);
    actionConf.addResource(new Path("file:///", ooziePreparedConfig));
    return actionConf;
  } else {
    return new Configuration(true);
  }
}

MapReduce Job Example

The following example of a MapReduce job is available in the Cloudera Altus S3 bucket of job examples. The job reads input data from a poetry file in the Cloudera Altus example S3 bucket.

To use the example, set up an S3 bucket in your AWS account to write output data. Set the S3 bucket permissions to allow write access when you run the job.

You can use the following command to submit a MapReduce job to run the example:
altus dataeng submit-jobs \
 --cluster-name ClusterName \
 --jobs '{ "mr2Job": {
             "mainClass": "com.cloudera.altus.sample.mr2.wordcount.WordCount",
             "jars": ["s3a://cloudera-altus-data-engineering-samples/mr2/wordcount/program/altus-sample-mr2.jar"],
             "arguments": [
                   "s3a://cloudera-altus-data-engineering-samples/mr2/wordcount/input/poetry/",
                   "s3a://NameOfOutputS3Bucket/OutputPath/"
                ] 
        }}'

The --cluster-name parameter requires the name of a MapReduce cluster.

Submitting a PySpark Job

You can use the following command to submit a PySpark job:
altus dataeng submit-jobs 
    --cluster-name=ClusterName
    --jobs '{
        "name": “WordCountJob",
        "pySparkJob": {
             "mainPy": "PathAndFilenameOfthePySparkMainFile",
             "sparkArguments" : "SparkArgumentsRequiredForYourApplication",
             "pyFiles" : "PythonFilesRequiredForYourApplication",
             "applicationArguments": [
               "PathAndFilenameOfFile1",
               "PathAndFilenameOfFile2"
             ]
        }}’

You can include the applicationArguments parameter to pass values to the main method and the sparkArguments parameter to specify Spark configuration settings. If you use the applicationArguments and sparkArguments parameters, you must escape the list of arguments. Alternatively, you can put the arguments into a file and pass the path and file name with the arguments parameters.

The --cluster-name parameter requires the name of a Spark cluster.

The pyFiles parameter takes the path and file names of Python modules. For example:
"pyFiles" : ["s3a://path/to/module1.py", "s3a://path/to/module2.py"]

PySpark Job Example

The following example uses a PySpark job to count words in a text file and write the result to an S3 bucket that you specify. The Python file is available in a Cloudera Altus S3 bucket of job examples and also reads input data from the Cloudera Altus S3 bucket. You can create an S3 bucket in your account to write output data.

The job in this example runs on a cluster with the Spark 2.2 service.

You can use the following command to submit a PySpark job to run the word count example:
altus dataeng submit-jobs \
    --cluster-name=ClusterName \
    --jobs '{
        "name": "Word Count Job",
        "pySparkJob": {
             "mainPy": "s3a://cloudera-altus-data-engineering-samples/pyspark/wordcount/wordcount2.py",
             "sparkArguments" : "--executor-memory 1G --num-executors 2",
             "applicationArguments": [
                   "s3a://cloudera-altus-data-engineering-samples/pyspark/wordcount/input/HadoopPoem0.txt",
                   "s3a://NameOfOutputS3Bucket/PathToOutputFile"
             ]
        }}’

If you need to use a specific Python environment for your PySpark job, you can use the --instance-bootstrap-script parameter to include a bootstrap script to install a custom Python environment when Altus creates the cluster.

For an example of how to use a bootstrap script in the create-aws-cluster command to install a Python environment for a PySpark job, see Example: Creating a Cluster for a PySpark Job.