Cloudera Enterprise in the Cloud
Increasingly, customers are evaluating cloud environments to solve their big data challenges. More companies are moving to the cloud as they realize that cloud computing increases efficiency and reduces cost. Customers who use Hadoop in the cloud find that a public cloud infrastructure allows them to quickly provision clusters at a lower cost. They can expand or shrink clusters to only use and pay for the resources they need to meet demand.
When customers make the move to the cloud, they might decide to replicate their applications in the cloud to avoid the expense of re-architecting the applications for the cloud. However, simply running workloads in the cloud the same way that they were run on premises may not take full advantage of native cloud features and can turn out to be an expensive solution.
In most cases, Cloudera recommends architecting applications to efficiently use cloud resources and take advantage of cloud features such as transient and elastic clusters, optimized instance types, and object storage. Workloads that run periodically can take advantage of transient clusters that are automatically terminated when the workload completes. By combining transient clusters with Cloudera’s pay-per-use capabilities, companies can run their workloads more efficiently and cost effectively. Object storage in the cloud offers scalability and flexibility in data storage and management.
Cloudera product offerings take advantage of native cloud infrastructure. Cloudera compute engines, including Spark, Hive, and Impala work directly with object storage. Cloudera Altus Director simplifies cluster lifecycle management while exploiting the inherent transient and elastic capabilities of the cloud. Altus Director can deploy CDH and its components in the leading public cloud environments: Amazon Web Service (AWS), Microsoft Azure, and Google Cloud Platform (GCP).
The process of migrating workloads from on premises to the cloud and adapting them to take full advantage of cloud infrastructure tends to involve multiple steps and a significant amount of time. Moreover, some applications may benefit from remaining on premises for an indefinite period of time. As a result, companies often decide to have hybrid operations and run workloads both on premises and in the cloud throughout the transition period. Cloudera products and CDH components are well positioned to support hybrid operations because they provide a consistent experience on premises and in the cloud. Applications that can more efficiently run within the company data center can stay on premises and applications that can take full advantage of the elasticity and low cost storage of the cloud can run in any of the public clouds.
- Data Engineering. Provides CDH components for fast and cost-effective storing and processing of data, including Spark and Navigator.
- Data Warehouse. Provides CDH components for discovering, analyzing, and understanding data, including Impala, Hive-on-Spark, and Navigator.
- Operational Database. Provides CDH components for building data-driven applications, including HBase, Spark Streaming, and Navigator.
- Workload Experience Manager (XM). A hosted application that provides insights to help you gain in-depth understanding of the workloads you send to clusters that are managed by Cloudera Manager on-premises or in the cloud.
Data engineering and ETL workloads typically do not run continuously, which makes them good candidates for running on transient CDH clusters in the cloud. This type of workload can have variable computing requirements and can benefit from the elasticity of clusters in the cloud.
Cloudera Data Engineering allows you to take advantage of elastic clusters in the cloud to quickly process large volumes of data of all types at a lower cost. Spin up clusters for a short-running data engineering job and terminate the cluster when the job completes.
Typically, data processed in data engineering workloads starts in a raw format in an object storage like Amazon S3. Spark and Hive are the most popular compute engines for data engineering and batch ETL. Spark can process streaming data and analyze the data in real-time. Hive is designed for ad-hoc querying and analysis of large volumes of data. For batch processing, you can use Hive-on-Spark, which supports cloud-native data access and enables you to take full advantage of a shared data layer for ETL and BI analytics workloads in the cloud.
For information about best practices for running data engineering workloads on AWS, see Data Engineering on AWS: Best Practices.
For information about configuring Hive ETL jobs to use Amazon S3, see Configuring Transient Hive ETL Jobs to Use the Amazon S3 Filesystem.
Altus for Data Engineers
Cloudera offers a managed cloud service for data engineering and ETL workloads.
Cloudera Altus is a cloud service platform that enables you to create clusters and run jobs specifically for data science and engineering workloads, including ETL and batch processing jobs. It is designed to provision clusters quickly and to make it easy for you to build and run your data engineering workloads in the cloud.
Altus offers multiple distributed processing engine options, including Hive, Spark, and MapReduce2 (MR2), for different data engineering workloads. The processing engines allow you to manage workloads such as ETL, machine learning, and large scale data processing.
For information about Cloudera Altus, see the Altus documentation.
Cloudera Data Warehouse offers high-performance SQL analytics in the cloud. You can use Impala to analyze data stored in shared data platforms such as HDFS and Kudu and cloud-native object storage like Amazon S3.
Depending on the demand for data availability, you can run business analytics jobs on transient or persistent clusters. Batch ETL jobs are more suited to transient clusters that terminate when the batch jobs complete. Data analytics jobs that have high demand run frequently and require persistent clusters. Both types of clusters allow you to grow and shrink resources to support peak and off-peak usage and still provide consistent query performance.
Data warehouse workloads typically have multiple users accessing the same clusters. Multi-tenant clusters have security concerns that must be addressed. Persistent clusters for data analytics jobs require a robust security framework and can use Kerberos and Sentry for user authentication and authorization.
Cloudera Operational Database provides a flexible operational database platform, capable of batch and stream processing. It supports relational SQL and NoSQL storage layers and has the ability to store an unlimited amount of structured and unstructured data.
Operational database jobs use HBase to perform fast searches on very large datasets. They can also use Spark Streaming to feed streaming data into HBase. Typically, operational database jobs run on highly-available long-running clusters backed by local storage with HDFS.
Deploying Cloudera Operational Database in the cloud provides the benefits of low cost and convenience associated with rapid provisioning and decommissioning of clusters. When you move operational database jobs to the cloud, you can quickly provision new clusters without incurring the cost of long-term on-premises infrastructure. With Altus Director, you can easily set up and maintain a secure test and development environment on demand and ease administrative overhead. Snapshots for backup and disaster recovery can be moved to cloud storage such as S3 and provide geographical dispersement and reliability and lower cost.
To manage unexpected and temporary surges in processing volume, such as a high-volume data ingestion or an exceptionally large batch job, you can start a large number of clusters to handle the demand. When the high-volume processing completes, you can terminate the clusters and go back to regular operation.
Cloud deployment also adds flexibility and portability between the on-premises data center and multiple cloud vendors.
Workload XM is a tool that provides insights to help you gain in-depth understanding of the workloads you send to clusters managed by Cloudera Manager on-premises or in the cloud. In addition, it provides information that can be used for troubleshooting failed jobs and optimizing slow jobs that run on those clusters. After a job ends, information about job execution is sent to Workload XM with the Telemetry Publisher, a role in the Cloudera Manager Management Service.
Workload XM uses the information to display metrics about the performance of a job. Additionally, Workload XM compares the current run of a job to previous runs of the same job by creating baselines. You can use the knowledge gained from this information to identify and address abnormal or degraded performance or potential performance improvements.
For more information about setting up and using Workload XM, see the Workload XM documentation.
You can download Cloudera Enterprise from the Cloudera downloads page.