Appendix A - Understanding Custom Installation Solutions
Cloudera hosts two types of software repositories that you can use to install products such as Cloudera Manager or CDH — repositories of RPM (RHEL and SLES) and Debian/Ubuntu packages, and parcel repositories, newly available with Cloudera Manager 4.5.
With parcels you can download, distribute and activate a new CDH version all from within Cloudera Manager. Further, only the Cloudera Manager server needs Internet access for downloading the desired parcel to a local repository on the Cloudera Manager server. Distribution of parcels to the remaining cluster members does not require internet access. Parcels are available for CDH 4.1.3 and onwards. Cloudera Manager continues to work with RPM (RHEL and SLES) and Debian/Ubuntu packages.
These repositories are effective solutions in most cases, but custom installation solutions are sometimes required. Using the software repositories requires client access over the Internet and results in the installation of the latest version of products.
An alternate solution is required if:
- You need to install older product versions. For example, in a CDH cluster, all hosts must run the same CDH version. After completing an initial installation, you may want to add nodes. This could be to increase the size of your cluster to handle larger tasks or to replace older hardware.
- The hosts on which you want to install Cloudera products are not connected to the Internet, so they are unable to reach the Cloudera repository. (Note that for a parcel installation, only the Cloudera manager server needs Internet access, but for a package installation, all cluster members need access to the Cloudera repository). Some organizations choose to partition parts of their network from outside access. Isolating segments of a network can provide greater assurance that valuable data is not compromised by individuals out of maliciousness or for personal gain. In such a case, the isolated computers are unable to access Cloudera's software repositories for new installations or upgrades.
In both of these cases, using a custom repository solution allows you to meet the needs of your organization, whether that means installing older versions of Cloudera software or installing any version of Cloudera software on machines that are disconnected from the Internet.
Parcel is a new packaging format that facilitates upgrading CDH from within the Cloudera Manager Admin console. You download, distribute, and activate a parcel from within the Parcels page, found under the Hosts tab in the Admin console.
Cloudera Manager downloads a parcel to a local repository, by default at /opt/cloudera/parcel-repo. (The location is configurable — see "Parcel Configuration Settings" in Managing Parcels.) Once the parcel is downloaded to the CM server, an internet connection is no longer needed to deploy the parcel. Once you click "Distribute", every Cloudera Manager agent will start to download the parcel from the Cloudera Manager server.
If your Cloudera Manager server does not have Internet access, you can obtain the required parcel file(s) and put them into the local repository. Once you copy the .parcel file into that directory and create the associated .sha file, CM will automatically pick it up and show it in the parcel page.
See Creating a Local Parcel Repository for instructions.
Understanding Package Management
Before getting into the details of how to configure a custom package management solution in your environment, it can be useful to have more information about:
- How package management tools work
- Which tools come with which operating systems
- Each tool's configuration files
How Do Packaging and Package Management Tools Interact?
Packages (rpm or deb files) help ensure that installations complete successfully by encoding each package's dependencies. That means that if you request the installation of a solution, all required elements can be installed at the same time. For example, hadoop-0.20-hive depends on hadoop-0.20. Package management tools, such as yum (RedHat), zypper (SUSE), or apt-get (Debian/Ubuntu) are tools that can find and install any required packages. For example, for RedHat, you might enter yum install hadoop-0.20-hive. Yum would inform you that the hive package requires hadoop-0.20 and offers to complete that installation for you. Zypper and apt-get provide similar functionality.
How Do Package Management Tools Find all Available Packages?
Package management tools rely on a list of repositories. Information about the tool's repository is stored in configuration files, the location of which varies according to the particular package management tool.
- Yum on RedHat/CentOS: /etc/yum.repos.d
- Zypper on SUSE: /etc/zypp/zypper.conf
- Apt-get on Debian/Ubuntu: /etc/apt/apt.conf (Additional repositories are specified using *.list files in the /etc/apt/sources.list.d/ directory.)
For example, on a typical CentOS system, you might find:
[user@localhost ~]$ ls -l /etc/yum.repos.d/ total 24 -rw-r--r-- 1 root root 2245 Apr 25 2010 CentOS-Base.repo -rw-r--r-- 1 root root 626 Apr 25 2010 CentOS-Media.repo
Inside those .repo files are pointers to one or many repositories. There are similar pointers inside configuration files for zypper and apt-get. In the following snippet from CentOS-Base.repo, there are two repositories defined: one named Base and one named Updates. The mirrorlist parameter points to a website which has a list of places where this repository can be downloaded.
# ... [base] name=CentOS-$releasever - Base mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=os #baseurl=http://mirror.centos.org/centos/$releasever/os/$basearch/ gpgcheck=1 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5 #released updates [updates] name=CentOS-$releasever - Updates mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=updates #baseurl=http://mirror.centos.org/centos/$releasever/updates/$basearch/ gpgcheck=1 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5 # ...
You can list the repositories you have enabled. The command varies according to operating system:
- RedHat/CentOS: yum repolist
- SUSE: zypper repos
- Debian/Ubuntu: Apt-get does not include a command to display sources, but you can determine sources by reviewing the contents of /etc/apt/sources.list and any files contained in /etc/apt/sources.list.d/.
The following shows an example of what you might find on a CentOS system in repolist:
[root@localhost yum.repos.d]$ yum repolist Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile * addons: mirror.san.fastserv.com * base: centos.eecs.wsu.edu * extras: mirrors.ecvps.com * updates: mirror.5ninesolutions.com repo id repo name status addons CentOS-5 - Addons enabled: 0 base CentOS-5 - Base enabled: 3,434 extras CentOS-5 - Extras enabled: 296 updates CentOS-5 - Updates enabled: 1,137 repolist: 4,867