Introduction
Cloudera Live is a four-node Cloudera cluster in Amazon Web Services (AWS) (with the option to add other nodes running software our partners) that provides a fast and easy way to get started with Apache Hadoop. It includes a self-guided tutorials focused on data processing for first time users for several Hadoop components. The instances recommended in this tutorial are intended to provide a cost-effective glimpse into Cloudera and are not guaranteed for production.
The diagram to the left shows the deployment in AWS. The template will automatically deploy four nodes with Hadoop services on the manager and worker nodes.
Prerequisites
AWS Account
Cloudera Live on AWS will deploy this cluster in your own account. You must register for this account beforehand, and it must have VPC enabled (note: EC2-Classic does not have VPC enabled). Your cluster can be created any any of the following regions (although we recommend US East (N. Virginia)) for its lower instance prices and locality to the Cloudera Live datasets:
- US East (N. Virginia)
- US West (Oregon)
- US West (N. California)
- EU (Ireland)
- Asia Pacific (Tokyo)
- Asia Pacific (Singapore)
- Asia Pacific (Sydney)
User Permissions and Limits
CloudFormation may fail to provision all the virtual hardware for the cluster due to either user permissions or limits on the account (e.g. the default limit on the number of VPCs is 5; if you already have 5, CloudFormation will not be able to create another for this cluster). If you encounter an error due to permissions, talk to your account administrator. If you encounter an error due to limits, you may be able to request an increase from Amazon, or you can delete other resources if they are not needed. The following resources will be created by CloudFormation:
- 4 - 5 EC2 instances (instance types may vary)
- 1 VPC (default limit of 5)
- 1 Internet Gateway (default limit of 4)
- 1 Subnet
- 1 Route
- 1 RouteTable
- 1 NetworkACL
- 1 SecurityGroup
EC2 Key Pair
When you are creating your cluster you will be prompted for a key pair to use with your instances. This key pair will be used to log in to the machines using SSH. If you do not already have a key pair you wish to use, you can create one by following Amazon’s documentation. You can find additional documentation on connecting to your AWS instances using SSH and this key here.
Access Code
You will need an access code from Cloudera’s website to use the Cloudera Live service. You can fill out the forms to request one here. Once you have received an access code, you can use CloudFormation to create your cluster.
Resources and Cost
Cloudera Live will automatically create and provision the following:
- 4 x m4.xlarge On-Demand instances, running Red Hat Enterprise Linux 6
- 4 x 65 GiB EBS volumes (one for additional storage on each instance)
- A VPC, with associated subnet, gateway, and other network resources
- A security group and network ACL, with the following traffic allowed:
- All outbound network traffic
- Ports 32768 - 61000: ephemeral ports for Linux
- Ports 49152 - 65535: ephemeral ports for Windows (Tableau only)
- Port 22: SSH access
- Port 7180: Cloudera Manager
- Port 7187: Cloudera Navigator
- Port 8888: Cloudera Hue
In the US East (N. Virginia) region, the estimated monthly cost of this cluster is approximately US$750. Zoomdata and Tableau will be deployed on a fifth m4.large or m4.xlarge instance if you selected one of those options. Users will also be able to choose from a different instance if desired.
Your account may have limits on some resources, like VPCs. Deployment via CloudFormation will fail if these limits are hit. The cluster will not automatically scale.
Procedure
CloudFormation
To use the recommended AWS region (N. Virginia), choose the desired deployment from the links below. It will take you to AWS CloudFormation prompt you to log in (if needed), and fill in the name and CloudFormation template URL for your cluster:
- Cloudera:
https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=Cloudera-Live&templateURL=https:%2F%2Fs3.amazonaws.com%2Fcloudera-live-2.0%2Fcloudformation-cloudera-live.template - Cloudera + Zoomdata:
https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=Cloudera-Live&templateURL=https:%2F%2Fs3.amazonaws.com%2Fcloudera-live-2.0%2Fcloudformation-cloudera-live-zoomdata.template - Cloudera + Tableau:
https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=Cloudera-Live&templateURL=https:%2F%2Fs3.amazonaws.com%2Fcloudera-live-2.0%2Fcloudformation-cloudera-live-tableau.template
If you would like to use an AWS Region other than US East (N. Virginia), simply login to AWS, navigate to the desired region using the menu in the top right-hand corner, click Services → CloudFormation, and click Create Stack. Enter a unique name for the stack (e.g. Cloudera-Live) in the Stack section. In the Template section select “Specify an Amazon S3 template URL”, and enter the URL of the deployment you would like to use:
- Cloudera:
https://s3.amazonaws.com/cloudera-live-2.0/cloudformation-cloudera-live.template - Cloudera + Zoomdata:
https://s3.amazonaws.com/cloudera-live-2.0/cloudformation-cloudera-live-zoomdata.template - Cloudera + Tableau:
https://s3.amazonaws.com/cloudera-live-2.0/cloudformation-cloudera-live-tableau.template
- Cloudera:
Click Next on the bottom right to go to the next step.
The Specify Parameters page requires you to enter the access code (provided by Cloudera) and the name of an EC2 key-pair that has already been created. You may also select larger instance types or modify the instance tag field to meet your needs.
Click Next and you will be prompted for any tags you wish to apply to the resources in this stack. We recommend you name your instances so they are easy to identify once they are deployed.
Click Next to review the setup.
Once you have reviewed the setup, click Create on the bottom right. You should be returned to a list of your CloudFormation stacks. If you do not see the list or your newest stack is not in the list, reload the page after a few seconds.
Note: To check the progress of your deployment, simply click on your Stack Name and click on the Events tab to monitor the progress. Events are ordered with the earliest events at the bottom and the latest events at the top. If any events are in red, the earliest will usually be the most helpful error message.
Deployment Process
After creating your stack, you will be returned to a list of all your stacks. Within about 5 minutes, you should see the stack you just created get marked in green with the label CREATE_COMPLETE. This means that the bare AWS resources have been created. If this fails due to permissions, or a limit on your account, the stack will be marked with ROLLBACK_COMPLETE instead. If this happens, you can find a specific error message by consulting the Events tab of the stack as described above.
Within another 10 minutes, you should receive an email confirming that the Cloudera Live on AWS service is communicating with EC2 instances and configuring your cluster. This email will give you a link to a progress bar you can watch of the rest of the process.
Within about 45 minutes your cluster should be complete and you should receive a final email with any remaining credentials you may need, and links to resources on your cluster to walk you through basic tutorials.
If any of these steps appear to fail or the emails have not arrived some time after the estimated windows, we recommend you delete your cluster and contact us for
support by posting on the Cloudera Community Forum.
Stopping and Starting Instances
Most of the cost of a Cloudera Live cluster comes from running EC2 instances. You can stop your instances if you would like to avoid accruing charges for a time. However, this has some sife-effects. We recommend the following:
Your public IP addresses will almost certainly have changed, meaning the links in your email will no longer work. You can find your Manager Node by entering the following search query on your EC2 Instances page:
tag:aws:cloudformation:logical-id : ManagerNode
Clicking on the resulting instance will show you the current public IP. Your guidance page, with links to resources and the IPs of your other instances is on port 80. Hue is on port 8888, Cloudera Manager is on port 7180, Cloudera Navigator is on port 7187.
After your instances have started up, it is recommend that you log into Cloudera Manager and restart all services (depending on the order the the instances start up, there may be some warnings shown on the status page). To do this, click on the downward arrow to the right of “Cloudera Live” and select “Restart”. Also, click on the downward arrow to the right of “Cloudera Management Service” and select “Restart”.
- Some services (like HBase and Sqoop) are not started by default when your cluster is first deployed and are not needed for the tutorials. These services will start automatically when you restart the cluster, but it is not recommended to run all services on the minimum instance types. Instead, consider stopping these services (or any others that you do not plan to use), or using larger instance types.
Deleting Your Cluster
If you wish to delete your cluster,go to Services → CloudFormation. Select the stack you created (it will be named ‘Cloudera-Live’ unless you entered a different name), and go to Actions → Delete Stack. The stack consists of several virtual network resources, so simply deleting the EC2 instances will not fully clean up the stack.
Other Notes
We recommend you use the access code and CloudFormation template within 24 hours of receiving them to ensure that they are current.