Managing Non-CDH Resources

From an operations perspective, CDH hosts may also run other processes, such as antivirus software or operating system backups. This topic presents information to help you plan for those processes.

Antivirus Software

If you use antivirus software on your servers, consider configuring it to skip scans on certain types of Hadoop-specific resources. It can take a long time to scan large files or directories with a large number of files. In addition, if your antivirus software locks files or directories as it scans them, those resources will be unavailable to your Hadoop processes during the scan, and can cause latency or unavailability of resources in your cluster. Consider skipping scans on the following types of resources:
  • Scratch directories used by services such as Impala
  • Log directories used by various Hadoop services
  • Data directories which can grow to petabytes in size
The specific directory names and locations depend on the services your cluster uses and your configuration. In general, avoid scanning very large directories and filesystems. Instead, limit write access to these locations using security mechanisms such as access controls at the level of the operating system, HDFS, or at the service level.

Operating System Backups

Many of the considerations outlined in Antivirus Software apply to operating system backups as well. Backing up scratch directories, log directories, and large amounts of data using standard operating system utilities may not make sense. In addition, many Hadoop resources cannot be backed up by conventional means due to their size and mutability. Consider excluding these types of resources from operating system backups, and using the techniques outlined in Backup and Disaster Recovery instead.

If you use Cloudera Manager, it stores its configuration in a database. You should regularly perform backups of this database, using the mechanisms provided by the database vendor. See Backing Up Databases.