Configuring High Availability for Llama
Llama High Availability (HA) uses an Active/Standby architecture, in which the active Llama is automatically elected using the ZooKeeper-based ActiveStandbyElector. The active Llama accepts RPC/Thrift connections and communicates with YARN. The standby Llama monitors the leader information in ZooKeeper, but doesn't accept RPC/Thrift connections.
Only one of the Llamas should be active to ensure the resources are not partitioned. Llama uses ZooKeeper Access Control Lists (ACLs) to claim exclusive ownership of the cluster when transitioning to active, and monitors this ownership periodically. If another Llama takes over, the first one realizes it within this period.
Reclaiming Cluster Resources
To claim resources from YARN, Llama spawns YARN applications and runs unmanaged ApplicationMasters. When a Llama goes down, the resources allocated to all the YARN applications spawned by it are not reclaimed until YARN times out those applications (default timeout is 10 minutes). On Llama failure, these resources are reclaimed by means of a Llama that kills any YARN applications spawned by this pair of Llamas.
Configure Llama HA by modifying the following configuration properties in /etc/llama/conf/llama-site.xml. There is no need for any additional daemons.
|llama.am.cluster.id||Cluster ID of the Llama pair, used to differentiate between different Llamas||llama||[cluster-specific]|
|llama.am.ha.enabled*||Whether to enable Llama HA||false||true|
|llama.am.ha.zk-quorum*||ZooKeeper quorum to use for leader election and fencing||[cluster-specific]|
|llama.am.ha.zk-base||Base znode for leader election and fencing data||/llama||[cluster-specific]|
|llama.am.ha.zk-timeout-ms||The session timeout, in milliseconds, for connections to ZooKeeper quorum||10000||10000|
|llama.am.ha.zk-acl||ACLs to control access to ZooKeeper||world:anyone:rwcda||[cluster-specific]|
|llama.am.ha.zk-auth||Authorization information to go with the ACLs||[cluster-acl-specific]|