Auto-Repair for Failed or Terminated Instances

Cloudera Director 2.5 and higher with Cloudera Manager 5.12 or higher includes an auto-repair feature that allocates new cluster instances to replace failed or terminated instances. For clusters running on AWS, this includes Spot instances that were terminated by Amazon because your bid price became lower than Amazon's current Spot instance price. Auto-repair checks to ensure that the number of instances in your cluster matches the number that was specified in the cluster template.
By default, auto-repair is not enabled. You can enable auto-repair in one of the following three ways:
  • During the installation process while configuring settings for the cluster:

  • In the Modify Cluster dropdown for an existing cluster:

    Choosing Configure Auto-repair in the Modify Cluster drop-down opens a dialog for enabling or disabling the auto-repair feature or configuring the cooldown period, which is the time that elapses between Cloudera Director's attempts to allocate new cluster instances:


  • If you are launching a cluster with the bootstrap-remote CLI command and a configuration file, you can enable auto-repair by setting autoRepairEnabled to true in the administrationSettings section of the configuration file:
    administrationSettings {
       # If enabled, Director will attempt to automatically repair
       # clusters whose instances have been terminated in the cloud provider.
    
       # autoRepairEnabled: false
       # autoRepairCooldownPeriodInSeconds: 1800
    }
Keep in mind the following facts about the auto-repair feature:
  • Auto-repair is only available with Cloudera Director 2.5 and higher running with Cloudera Manager 5.12 and higher.
  • Auto-repair only functions with instances that do not contain master roles.
  • Before stopping a cluster using the Elastic Block Storage (EBS) start/stop feature, you must disable auto-repair if it is enabled.
  • Auto-repair is disabled by default, and can be enabled (1) when creating a cluster in the web UI, (2) on the web UI page for an existing cluster, or (3) in the configuration file when launching a cluster with bootstrap-remote.
  • Auto-repair is only available for clusters managed with Cloudera Director server; the feature is not available for clusters running in the standalone client version of Cloudera Director.
  • Auto-repair will attempt to allocate the same type of instance that is missing from the cluster. So for an on-demand instance, another on-demand instance with identical specifications will be allocated. For a Spot instance, another Spot instance with the same bid price will be allocated.
  • The auto-repair cooldown period ensures that, even if auto-repair is enabled and your cluster has fewer instances than specified in the cluster template, Cloudera Director will not continuously attempt to repair the cluster, and you will therefore have frequent intervals when you can interact with the cluster to perform other tasks. The cooldown period is configurable in the web UI or the configuration file.