This is the documentation for Cloudera Manager 4.8.2.
Documentation for other versions is available at Cloudera Documentation.

Cloudera Manager Primer

Cloudera Manager is the industry’s first and most sophisticated management application for Apache Hadoop. Cloudera Manager sets the standard for enterprise deployment by delivering granular visibility into and control over every part of the Hadoop cluster — empowering operators to improve performance, enhance quality of service, increase compliance and reduce administrative costs. With Cloudera Manager, you can easily deploy and centrally operate the complete Hadoop stack. The application automates the installation process, reducing deployment time from weeks to minutes; gives you a cluster-wide, real-time view of nodes and services running; provides a single, central console to enact configuration changes across your cluster; and incorporates a full range of reporting and diagnostic tools to help you optimize performance and utilization.

This primer introduces the basic concepts, structure, and functions of Cloudera Manager. It covers the following topics:

Terminology

To effectively use Cloudera Manager, you should first understand its terminology. The relationship between the terms is illustrated below and their definitions follow:

Some of the terms, such as cluster and service, will be used without further explanation. Others, such as gateway and role group, are expanded upon in later sections.

  • deployment - A configuration of Cloudera Manager and all the services and hosts it manages.
  • cluster - A logical entity that contains a set of hosts, a single version of CDH installed on the hosts, and the service and role instances running on the hosts. A host may only belong to one cluster.
  • host - A logical or physical machine that runs role instances.
  • rack - A physical entity that contains a set of hosts typically served by the same switch.
  • service - A category of functionality within the Hadoop ecosystem. Sometimes referred to as a service type. For example: HDFS, MapReduce.
  • service instance - An instance of a service running on a cluster. A service instance spans many role instances. For example: HDFS-1 and MapReduce-1.
  • role - A category of functionality within a service. For example, the HDFS service has the following roles: NameNode, SecondaryNameNode, DataNode, and Balancer. Sometimes referred to as a role type.
  • role instance - An instance of a role, assigned to a host. It typically maps to a Unix process. For example: NameNode-h11 and DataNode-h11.
  • role group - A set of configuration properties for a role and set of role instances.
  • host template - A set of role groups. When a template is applied to a host, a role instance from each role group is created and assigned to that host.
  • gateway - A role that designates a host that should receive a client configuration for a service when the host does not have any roles for that service running on it.

It's also helpful to always remember this basic principle: Whereas in traditional environments multiple services run on one host, in Hadoop — and other distributed systems — a service runs on many hosts.

  Note: In programming languages, the term "string" may indicate either a type (java.lang.String) or an instance of that type ("hi there"). A common point of confusion is the type-instance nature of "service" and "role". Cloudera Manager and this primer sometimes uses the same term for type and instance.

For example, the sentence before this note uses the word "services" and "service", but is actually referring to service instances. Likewise the Cloudera Manager Admin Console "Services" screen displays service instances. When it's necessary to distinguish between types and instances, the word "type" is appended to indicate a type and the word "instance" is appended to explicitly indicate an instance.

Architecture

As depicted below, the heart of Cloudera Manager is the Cloudera Manager Server. The Server hosts the Admin Console Web Server and the application logic. It is responsible for installing software, configuring, starting, and stopping services, and managing the cluster on which the services run.

The Cloudera Manager Server works with several other components:
  • Agent - installed on every host. It is responsible for starting and stopping processes, unpacking configurations, triggering installations, and monitoring the host.
  • Database - stores configuration and monitoring information. There are typically multiple logical databases running across one or more database servers. For example, the Cloudera Manager Server and the monitoring daemons use different logical databases.
  • Cloudera Repository - repository of software for distribution by Cloudera Manager.
  • Clients provide an interface for interacting with the server.
    • Admin Console - Web-based UI with which administrators manage clusters and Cloudera Manager.
    • API - API with which developers create custom Cloudera Manager applications.

Heartbeating

Heartbeats are a primary communication mechanism in Cloudera Manager. By default the Agents send heartbeats every 15 seconds to the Cloudera Manager Server. However, to reduce user latency the frequency is increased when state is changing.

During the heartbeat exchange the Agent notifies the Cloudera Manager Server the actions it is performing. In turn the Cloudera Manager Server responds with the actions the Agent should be performing. Both the Agent and the Cloudera Manager Server end up doing some reconciliation. For example, if you start a service via the UI, the Agent attempts to start the relevant processes; if a process fails to start, the server marks the start command as having failed.

State Management

The Cloudera Manager Server maintains the state of the cluster. This state can be divided into two categories: "model" and "runtime", both of which are stored in the Cloudera Manager Server database.

Cloudera Manager models the Hadoop stack: its roles, configurations, and inter-dependencies. Model state captures what is supposed to run where, and with what configurations. That a cluster contains 17 hosts, each of which is supposed to run a DataNode, is model state. You interact with the model through the Cloudera Manager configuration screens (and operations like "Add Service").

Runtime state is what processes are running where, and what commands (for example, rebalance HDFS or execute a Backup/Disaster Recovery schedule or rolling restart or stop) are currently being executed. The runtime state includes the exact configuration files needed to run a process. In fact, when you press "Start" in Cloudera Manager Admin Console, the server gathers up all the configuration for the relevant services and roles, validates it, generates the configuration files, and stores them in the database.

When you update a configuration (for example, the Hue Server web port), you've updated the model state. However, if Hue is running while you do this, it's still using the old port. When this kind of mismatch occurs, the role is marked as having an "outdated configuration". To resynchronize, you restart the role (which triggers the configuration re-generation and process restart).

While Cloudera Manager models all of the reasonable configurations, inevitably, there are some corner cases that require special handling. To allow you to workaround, for example, some bug (or, perhaps, to explore unsupported options), Cloudera Manager supports a "safety valve" mechanism that lets you plug in directly to the configuration files. (The analogy stems from the idea that a safety valve "releases pressure" if the model doesn't hold up to a real-world use case.)

Configuration Management

Cloudera Manager defines configuration at several levels:

  • The service level may define configurations that apply to the entire service instance, such as an HDFS service's default replication factor (dfs.replication).
  • The role group level may define configurations that apply to the member roles, such as the DataNodes' handler count (dfs.datanode.handler.count). This can be set differently for different groups of DataNodes. For example, DataNodes running on more capable hardware may have more handlers.
  • The role instance level may override configurations that it inherits from its role group. This should be used sparingly, because it easily leads to configuration divergence within the role group. One example usage is to temporarily enable debug logging in a specific role instance to troubleshoot an issue.
  • Hosts have configurations related to monitoring, software management, and resource management.
  • Cloudera Manager itself has configurations related to its own administrative operations.

Role Groups

It is possible to set configuration at the service instance (for example, HDFS-1), role (for example, all DataNodes), or role instance (for example, the DataNode on host17). An individual role inherits the configurations set at the service and role-type levels. Configurations made at the role level override those from the role-type level. While this approach offers flexibility when configuring clusters, it is tedious to configure subsets of roles in the same way.

Cloudera Manager supports role groups, a mechanism for assigning configurations to a group. The members of those groups then inherit those configurations. For example, in a cluster with heterogeneous hardware, a DataNode role group can be created for each host type and the DataNodes running on those hosts can be assigned to their corresponding role group. That makes it possible to set the configuration for all the DataNodes running on the same hardware by modifying the configuration of one role group.

In addition to making it easy to manage the configuration of subsets of roles, role groups also make it possible to maintain different configurations for experimentation or managing shared clusters for different users and/or workloads.

Host Templates

In typical environments, sets of hosts have the same hardware and the same set of services running on them. A host template defines a set of role groups (at most one of each type) in a cluster and provides two main benefits:
  • Adding new hosts to clusters easily - multiple hosts can have roles from different services created, configured, and started in a single operation.
  • Altering the configuration of roles from different services on a set of hosts easily - which is useful for quickly switching the configuration of an entire cluster to accommodate different workloads or users.

Server and Client Configuration

Administrators are sometimes surprised that modifying /etc/hadoop/conf and then restarting HDFS has no effect. That is because service instances started by Cloudera Manager do not read configurations from the default locations. To use HDFS as an example, when not managed by Cloudera Manager, there would usually be one HDFS configuration per host, located at /etc/hadoop/conf/hdfs-site.xml. Server-side daemons and clients running on the same host would all use that same configuration.

Cloudera Manager distinguishes between server and client configuration. In the case of HDFS, the file /etc/hadoop/conf/hdfs-site.xml contains only configuration relevant to an HDFS client. That is, by default, if you run a program that needs to communicate with Hadoop, it will get the addresses of the NameNode and JobTracker, and other important configurations, from that directory. A similar approach is taken for /etc/hbase/conf and /etc/hive/conf.

The HDFS server-side daemons (for example, NameNode and DataNode) obtain their configurations from a private per-process directory, under /var/run/cloudera-scm-agent/process/unique-process-name. Giving each process its own private execution and configuration environment allows us to control each process independently, which is crucial for some of the more esoteric configuration scenarios that show up. Here are the contents of an example 879-hdfs-NAMENODE process directory:

$ tree -a /var/run/cloudera-scm-Agent/process/879-hdfs-NAMENODE/
  /var/run/cloudera-scm-Agent/process/879-hdfs-NAMENODE/
  ├── cloudera_manager_Agent_fencer.py
  ├── cloudera_manager_Agent_fencer_secret_key.txt
  ├── cloudera-monitor.properties
  ├── core-site.xml
  ├── dfs_hosts_allow.txt
  ├── dfs_hosts_exclude.txt
  ├── event-filter-rules.json
  ├── hadoop-metrics2.properties
  ├── hdfs.keytab
  ├── hdfs-site.xml
  ├── log4j.properties
  ├── logs
  │   ├── stderr.log
  │   └── stdout.log
  ├── topology.map
  └── topology.py

There are several advantages to distinguishing between server and client configuration:

  • Sensitive information in the server-side configuration, such as the password for the Hive metastore RDBMS, is not exposed to the clients.
  • A service that depends on another service may deploy with customized configuration. For example, to get good HDFS read performance, Cloudera Impala needs a specialized version of the HDFS client configuration, which may be harmful to a generic client. This is achieved by separating the HDFS configuration for the Impala daemons (stored in the per-process directory mentioned above) from that of the generic client (/etc/hadoop/conf).
  • Client configuration files are much smaller and more readable. This also avoids confusing non-administrator Hadoop users with irrelevant server-side properties.

Deploying Client Configurations and Gateways

A client configuration is a ZIP file that contain the relevant configuration files with the settings for a service. Each zip file contains the set of configuration files needed by the service. For example, the MapReduce client configuration zip file contains copies of  core-site.xmlhadoop-env.shhdfs-site.xmllog4j.properties,  and  mapred-site.xml. Cloudera Manager supports a "Client Configuration URLs" action to distribute the client configuration file to users outside the cluster.

Cloudera Manager can deploy client configurations within the cluster; each applicable service has a "Deploy Client Configuration" action. This action does not necessarily deploy the client configuration to the entire cluster; it only deploys the client configuration to all the hosts that this service has been assigned to. For example, suppose a cluster has 10 nodes, and a MapReduce service is running on hosts 1-9. When you use Cloudera Manager to deploy the MapReduce client configuration, host 10 will not get a client configuration, because the MapReduce service has no role assigned to it. This design is intentional to avoid deploying conflicting client configurations from multiple services.

To deploy a client configuration to a host that does not have a role assigned to it you use a gateway. A gateway is a marker to convey that a service includes a particular host. Unlike all other roles it has no associated process. In the preceding example, to deploy the MapReduce client configuration to host 10, you assign a MapReduce gateway role to that host.

Process Management

In a non-Cloudera Manager managed cluster, you most likely start a role instance using an init script, for example, service hadoop-hdfs-datanode start. Cloudera Manager does not use init scripts for the daemons it manages; in a Cloudera Manager managed cluster, starting and stopping services using init scripts will not work.

In a Cloudera Manager managed cluster you can only start or stop services via Cloudera Manager. Cloudera Manager uses an open source supervisor called supervisord that takes care of redirecting log files, notifying of process failure, setting the effective user ID of the calling process to the right user, and so forth. Cloudera Manager supports automatically restarting a crashed process. It will also flag a role instance with a bad state if it crashes repeatedly right after start up.

It is worth noting that stopping Cloudera Manager and the Cloudera Manager Agents will not bring down your cluster; any running instances will keep running.

One of the Agent's main responsibilities is to start and stop processes. When the Agent detects a new process from the heartbeat, the Agent creates a directory for it in /var/run/cloudera-scm-agent and unpacks the configuration. These actions reflect an important point: a Cloudera Manager process never travels alone. In other words, a process is more than just the arguments to exec()—it also includes configuration files, directories that need to be created, and other information.

The Agent itself is started by init.d at start-up. It, in turn, contacts the server and figures out what processes should be running. The Agent is monitored as part of Cloudera Manager's host monitoring: if the Agent stops heartbeating, the host will be marked as having bad health.

Software Distribution Management

A major function of Cloudera Manager is to distribute and activate software in your cluster. Cloudera Manager supports two software distribution formats: packages and parcels.

A package is a distribution format that contains compiled code and meta-information such as a package description, version, and dependencies. Package management systems evaluate this meta-information to allow package searches, perform upgrades to a newer version, and ensure that all dependencies of a package are fulfilled. Cloudera Manager uses the native "system package manager" for each supported OS.

A parcel is a binary distribution format containing the program files, along with additional metadata used by Cloudera Manager. There are a few notable differences between parcels and packages:

  • Parcels are self-contained and installed in a versioned directory. This means that multiple versions of a given parcel can be installed "side-by-side". You can then designate one of these installed versions as the "active" one. With traditional packages, only one package can be installed at a time so there's no distinction between what's "installed" and what's "active".
  • Parcels can be installed at any location in the filesystem.

Advantages of Parcels

As a consequence of their unique properties, parcels offer a number of advantages over packages:

  • CDH is distributed as a single object - In contrast to having a separate package for each part of CDH, when using parcels there is just a single object to install. This is especially useful when managing a cluster that isn't connected to the Internet.
  • Internal consistency - All CDH components are matched so there isn't a danger of different parts coming from different versions of CDH.
  • Installation outside of /usr - In some environments, Hadoop administrators do not have privileges to install system packages. In the past, these administrators had to fall back to CDH tarballs, which deprived them of a lot of infrastructure that packages provide. With parcels, administrators can install to /opt or anywhere else without having to step through all the additional manual steps of regular tarballs.
      Note: With parcel software distribution, the path to the CDH libraries is /opt/cloudera/parcels/CDH/lib instead of the usual /usr/lib.
  • Installation of CDH without sudo - Parcel installation is handled by the Cloudera Manager Agent running as root so it's possible to install CDH without needing sudo.
  • Decouples distribution from activation - Due to side-by-side install capabilities, it is possible to stage a new version of CDH across the cluster in advance of switching over to it. This allows the longest running part of an upgrade to be done ahead of time without affecting cluster operations, consequently reducing the downtime associated with upgrade.
  • Rolling upgrades - These are only possible with parcels, due to their side-by-side nature. Traditional packages require shutting down the old process, upgrading the package, and then starting the new process. This can be hard to recover from in the event of errors and requires extensive integration with the package management system to function seamlessly. With the new version staged side-by-side, switching to a new minor version is simply a matter of changing which version of CDH is used when restarting each process. It then becomes practical to do upgrades with rolling restarts, where service roles are restarted in the right order to switch over to the new version with minimal service interruption. Note that major version upgrades (for example, CDH3 to CDH4) require full service restarts due to the substantial changes between the versions.
  • Easy downgrades - Reverting back to an older minor version can be as simple as upgrading. Note that some CDH components may require explicit additional steps due to schema upgrades.

Parcel Life Cycle

The life cycle of a parcel is depicted below and the descriptions of the phases follows:

  • Download - When a parcel is available in the Cloudera repository, Cloudera Manager can download it to the Cloudera Manager Server machine.
  • Distribution - Once the Cloudera Manager Server has downloaded the parcel, it can distribute it to all the hosts in the cluster. This process can be tuned in terms of how many hosts receive the parcel at the same time and the total aggregate bandwidth used for the process.
  • Activation - Once a parcel is distributed, you can activate it. Once activated, the parcel will be used for any processes that are subsequently started or restarted.
  • Deactivation - Similarly, a parcel can be deactivated (and will automatically be deactivated if another one is activated).
  • Removal - This is the reverse of distribution. A parcel that has been deactivated and is not serving any current processes is eligible for removal from the hosts in the cluster.
  • Deletion - Once removed from the cluster, the parcel can be deleted from the Cloudera Manager Server, which completes the life cycle of the parcel.
For example, the following screenshot:

shows:
  • One activated CDH parcel
  • One SOLR parcel distributed and ready to activate
  • One Impala parcel being downloaded
  • One CDH parcel being distributed

Cloudera Manager Capabilities Enabled by Parcels

In addition to the advantages of parcels described in the preceding section, parcels enable capabilities delivered by Cloudera Manager:

  • Upgrade management - Cloudera Manager can fully manage all the steps involved in a CDH version upgrade. In contrast, with packages, Cloudera Manager can only help with initial installation.
  • Distributing additional components - Parcels are not limited to CDH. Cloudera Impala, Cloudera Search, and LZO are also available.
  • Compatibility with other distribution tools - If there are specific reasons to use other tools for download and/or distribution, you can do so, and Cloudera Manager will work alongside your other tools. For example, you can handle distribution with something like Puppet. Or, if you want to download the parcel to Cloudera Manager Server manually (perhaps because your cluster has no Internet connectivity) and then have Cloudera Manager distribute the parcel to the cluster, you can do that too.

Management Services

Cloudera Manager manages its helper management services: Service Monitor, Activity Monitor, Host Monitor, Event Server, Reports Manager, and Alert Publisher. Cloudera Manager manages each separately (as opposed to rolling them all in as part of the Cloudera Manager Server) for scalability (for example, on large deployments it's useful to put the monitors on their own hosts) and isolation.

Metric Collection and Display

To perform its monitoring, Cloudera Manager collects metrics. A metric is a numeric value, associated with a name (for example, "CPU seconds"), an entity it applies to ("host17"), and a timestamp. Most metric collection is performed by the Agent. The Agent communicates with a supervised process, requests the metrics, and forwards them to the service monitor. In most cases, this is done once per minute.

A few special metrics are collected by the Service Monitor. For example, the Service Monitor hosts an HDFS canary, which tries to write, read, and delete a file from HDFS at regular intervals, and measure whether it succeeded, and how long it took. Once metrics are received, they're aggregated and stored.

Using the Charts page in Cloudera Manager, you can query and explore the metrics being collected. Charts display time series, which are streams of metric data points for a specific entity. Each metric data point contains a timestamp and the value of that metric at that timestamp.

Some metrics (for example, total_cpu_seconds) are counters, and the appropriate way to query them is to take their rate over time, which is why a lot of metrics queries contain the dt0 function. For example, dt0(total_cpu_seconds). (The dt0 syntax is intended to remind you of derivatives. The 0 indicates that the rate of a monotonically increasing counter should never have negative rates.)

Health Tests

The Service Monitor continually evaluates "health tests" for every entity in the system. For example, a simple health test is whether there's enough disk space in every NameNode data directory. A more complicated health test may evaluate when the last checkpoint for HDFS was compared to a threshold or whether a DataNode is connected to a NameNode. Some of these health tests also aggregate other health tests: in a distributed system like HDFS, it's normal to have a few DataNodes down (assuming you've got dozens of machines), so we allow for setting thresholds on what percentage of nodes should color the entire service down.

Health tests can assume three values: Good, Concerning, and Bad. A test returns Concerning health if the test falls below a warning threshold. A test returns Bad if the test falls below a critical threshold. The overall health of a service or role instance is a roll-up of its health checks. If any health test is Concerning (but none are Bad) the role's or service's health is Concerning; if any health test is Bad, the service's or role's health is Bad.

One common question is whether monitoring can be separated from configuration. One of the goals for monitoring is to enable it without needing to do additional configuration and installing additional tools (for example, Nagios). By having a deep model of the configuration, Cloudera Manager is able to know which directories to monitor, which ports to use, and what credentials to use for those ports. This tight coupling means that, when you install Cloudera Standard (the free version of the Cloudera platform), all the monitoring is enabled.

Events and Alerts

An event is a record that something of interest has occurred: a service's health has changed state, a log message (of the appropriate severity) has been logged, and so on. For example, when a health test is Bad, an event is usually created.

An alert is event that is considered especially noteworthy and is triggered by a selected event. Alerts appear with a red badge when you view events in a list. You can configure the Alert Publisher to send alert notifications by email or via SNMP trap to a trap receiver such as HP OpenView.