Overview of Altus Data Engineering
Altus Data Engineering enables you to create clusters and run jobs specifically for data science and engineering workloads. The Altus Data Engineering service offers multiple distributed processing engine options, including Hive, Spark, Hive on Spark, and MapReduce2 (MR2), which allow you to manage workloads in ETL, machine learning, and large scale data processing.
Altus Data Engineering Service Architecture
When you create an Altus Data Engineering cluster or submit a job in Altus, the Altus Data Engineering service accesses your AWS account or Azure subscription to create the cluster or run the job on your behalf.
If your Altus account uses AWS, an AWS administrator must set up a cross-account access role to provide Altus access to your AWS account. When a user in your Altus account creates an Altus Data Engineering cluster, the Altus Data Engineering service uses the Altus cross-account access credentials to create the cluster in your AWS account.
For more information about the AWS cross-account access role, see Cross-Account Access Role.
If your Altus account uses Azure, an administrator of your Azure subscription must provide consent for Altus to access the resources in your subscription. When a user in your Altus account creates an Altus Data Engineering cluster, your consent allows the Altus Data Engineering service to create the cluster in your Azure subscription.
For more information about consenting to Altus access, see Consenting to Altus Access to Subscription Resources.
Altus manages the clusters and jobs in your cloud provider account. You can configure your Altus Data Engineering cluster to be terminated when the cluster is no longer in use.
When you submit a job to run on a cluster, the Altus Data Engineering service creates a job queue for the cluster and adds the job to the job queue. The Altus Data Engineering service then runs the jobs in the cluster in your cloud provider account. In AWS, the jobs in the cluster access the Amazon S3 object storage for data input and output. In Azure, the jobs in the cluster access Microsoft Azure Data Lake Store (ADLS) for data input and output.
The Altus Data Engineering service sends cluster diagnostic information and job execution metrics to Altus. It also stores the cluster and job information in the your cloud object storage.
The following diagram shows the architecture and process flow of Altus Data Engineering: