Hardware Requirements Guide

To assess the hardware and resource allocations for your cluster, you need to analyze the types of workloads you want to run on your cluster, and the CDH components you will be using to run these workloads. You should also consider the size of data to be stored and processed, the frequency of workloads, the number of concurrent jobs that need to run, and the speed required for your applications.

As you create the architecture of your cluster, you will need to allocate Cloudera Manager and CDH roles among the hosts in the cluster to maximize your use of resources. Cloudera provides some guidelines about how to assign roles to cluster hosts. See Cluster Hosts and Role Assignments. When multiple roles are assigned to hosts, add together the total resource requirements (memory, CPUs, disk) for each role on a host to determine the required hardware.

For information about how workloads affect sizing decisions, see the following blog post: How-to: Select the Right Hardware for Your New Hadoop Cluster.

Cloudera Manager

Cloudera Manager Server Storage Requirements

Component Storage Notes
Partition hosting /usr 1 GB  
Cloudera Manager Database 5 GB If the Cloudera Manager Database shares a host with the Service Monitor and Host Monitor, more storage space is required to meet the requirements for those components.

Host Based Cloudera Manager Server Requirements

Number of Cluster Hosts Database Host Configuration Heap Size Logical Processors Cloudera Manager Server Storage Local Directory
Very small (≤10) Shared 2 GB 4 5 GB
Small (≤20) Shared 4 GB 6  
Medium (≤200) Dedicated 8 GB 6  
Large (≤500) Dedicated 10 GB 8  
Extra Large (>500) Dedicated 16 GB 16  

Service Monitor Requirements

The requirements for the Service Monitor are based on the number of monitored entities. To see the number of monitored entities, perform the following steps:

  1. Open the Cloudera Manager Admin Console and click Clusters > Cloudera Management Service.
  2. Find the Cloudera Management Service Monitored Entities chart. If the chart does not exist, add it from the Chart Library.
For more information about Cloudera Manager entities, see Cloudera Manager Entity Types.

Clusters with HDFS, YARN, or Impala

Use the recommendations in this table for clusters where the only services with worker roles are HDFS, YARN, or Impala.

Number of Monitored Entities Number of Hosts Required Java Heap Size Recommended Non-Java Heap Size
0-2,000 0-100 1 GB 6 GB
2,000-4,000 100-200 1.5 GB 6 GB
4,000-8,000 200-400 1.5 GB 12 GB
8,000-16,000 400-800 2.5 GB 12 GB
16,000-20,000 800-1,000 3.5 GB 12 GB

Clusters with HBase, Solr, Kafka, or Kudu

Use these recommendations when services such as HBase, Solr, Kafka, or Kudu are deployed in the cluster. These services typically have larger quantities of monitored entities.

Number of Monitored Entities Number of Hosts Required Java Heap Size Recommended Non-Java Heap Size
0-30,000 0-100 2 GB 12 GB
30,000-60,000 100-200 3 GB 12 GB
60,000-120,000 200-400 3.5 GB 12 GB
120,000-240,000 400-800 8 GB 20 GB

Host Monitor

The requirements for the Host Monitor are based on the number of monitored entities.

To see the number of monitored entities, perform the following steps:

  1. Open the Cloudera Manager Admin Console and click Clusters > Cloudera Management Service.
  2. Find the Cloudera Management Service Monitored Entities chart. If the chart does not exist, add it from the Chart Library.
For more information about Cloudera Manager entities, see Cloudera Manager Entity Types.

For more information about Cloudera Manager entities, see Cloudera Manager Entity Types.

Number of Hosts Number of Monitored Entities Heap Size Non-Java Heap Size
0-200 <6k 1 GB 2 GB
200-800 6k-24k 2 GB 6 GB
800-1000 24k-30k 3 GB 6 GB

Ensure that you have at least 25 GB of disk space available for the Host Monitor, Service Monitor, Reports Manager, and Events Server databases.

For more information refer to Host Monitor and Service Monitor Memory Configuration.

Reports Manager

The Reports Manager fetches the fsimage from the NameNode at regular intervals. It reads the fsimage and creates a Lucene index for it. To improve the indexing performance, Cloudera recommends provisioning a host as powerful as possible and dedicating an SSD disk to the Reports Manager.
Reports Manager
Component Java Heap CPU Disk
Reports Manager 3-4 times the size of the fsimage.
  • Minimum: 8 cores
  • Recommended: 16 cores (32 cores, with hyperthreading enabled.)
1 dedicated disk that is at least 20 times the size of the fsimage. Cloudera strongly recommends using SSD disks.

Agent Hosts

An unpacked parcel requires approximately three times the space of the packed parcel that is stored on the Cloudera Manager Server.

Event Server

Reserve 20 GB of disk space for the Event Server Index Directory. (The location of this directory is set by the Event Server Index Directory Event Server configuration property.)

Cloudera Navigator

The sizing of Navigator components varies heavily depending on the size of the cluster and the number of audit events generated. Refer to Minimum Recommended Memory and Disk Space for more information.

Cloudera Navigator
Component Java Heap / Memory CPU Disk
Navigator Audit Server Minimum: 2-3 GB of Java Heap

Configure this value using the Java Heap Size of Auditing Server in Bytes configuration property.

Minimum: 1 core The database used by the Navigator Audit Server must be able to accommodate hundreds of gigabytes (or tens of millions of rows per day). The database size may reach a terabyte.

Ideally, the database should not be shared with other services because the audit insertion rate can overwhelm the database server making other services using same database less responsive.

See Storage Space Planning for Cloudera Navigator.

Navigator Metadata Server
  • Minimum: 10 GB of Java Heap
  • Recommended: 20 GB Java Heap

Add 20 GB for operating system buffer cache, however memory requirements can be much higher on a busy cluster and could require provisioning a dedicated host. Navigator logs include estimates based on the number of objects it is tracking.

Configure this value using the Java Heap Size of Navigator Metadata Server in Bytes configuration property.

Minimum: 1 core
  • Minimum: 50 GB
  • Recommended: 100-200 GB
Navigator HSM KMS 16 GB RAM Minimum: 2 GHz 64-bit quad core 40 GB, using moderate to high-performance drives.

Cloudera Data Science Workbench

Hardware Component Requirement Notes
CPU 16+ CPU (vCPU) cores Allocate at least 1 CPU core per session. 1 CPU core is often adequate for light workloads.
Memory 32 GB RAM
  • As a general guideline, Cloudera recommends nodes with RAM between 60GB and 256GB
  • Allocating less than 2 GB of RAM can lead to out-of-memory errors for many applications.

Disk
  • Root Volume: 100 GB
  • Application Block Device or Mount Point (Master Host Only): 500 GB
  • Docker Image Block Device: 500 GB

SSDs are strongly recommended for application data storage.

CDH

Accumulo

Component Java Heap CPU Disk
Master Minimum: 1 GiB
  • 100-1000 tablets: 4 GB
  • 10,000 or more tablets with 200 or more Tablet Servers: 8 GB
  • 10,000 or more tablets with 300 or more Tablet Servers: 12 GB

Set this value using the Tablet Server Max Heapsize Accumulo configuration property.

See Flume Memory Consumption

2 Cores. Add more for large clusters or bulk load. 1 disk for local logs
Tablet Server
  • Minimum: 512 MB Java Heap
  • Recommended: 8 GB
  • Medium scale production (200 or more tablets per Tablet Server): 16 GB

Set this value using the Tablet Server Max Heapsize Accumulo configuration property.

Minimum 4 dedicated cores. You can add more cores for larger clusters, when using replication, or for bulk loads.
  • 4 or more spindles for each HDFS DataNode
  • 1 disk for local logs (this disk can be shared with the operating system and/or other Hadoop logs
Tracer 1 - 2 GB, depending on cluster workloads

Set this value using the Tracer Max Heapsize Accumulo configuration property.

2 or more dedicated cores, depending on cluster size and workloads 1 disk for local logs, which can be shared with the operating system and/or other Hadoop logs
GC Role 1-2 GB

Set this value using the Garbage Collector Max Heapsize Accumulo configuration property.

2 or more dedicated cores, depending on cluster size 1 disk for local logs, which can be shared with the operating system and/or other Hadoop logs
Monitor Role 1-2 GB

Set this value using the Monitor Max Heapsize Accumulo configuration property.

2 or more dedicated cores, depending on cluster size and workloads 1 disk for local logs, which can be shared with the operating system and/or other Hadoop logs

For additional information, see Apache Accumulo on CDH Installation Guide.

Flume

Component Java Heap CPU Disk
Flume
  • Minimum: 1 GB
  • Maximum 4 GB
  • Java Heap size should be greater than the maximum channel capacity

Set this value using the Java Heap Size of Agent in Bytes Flume configuration property.

See Flume Memory Consumption

Calculate the number of cores using the following formula:

(Number of sources + Number of sinks ) / 2

Multiple disks are recommended for file channels, either a JBOD setup or RAID10 (preferred due to increased reliability).

HDFS

HDFS
Component Memory CPU Disk
JournalNode 1 GB (default)

Set this value using the Java Heap Size of JournalNode in Bytes HDFS configuration property.

1 core minimum 1 dedicated disk
NameNode
  • Minimum: 1 GB (for proof-of-concept deployments)
  • Add an additional 1 GB for each additional 1,000,000 blocks

    Snapshots and encryption can increase the required heap memory.

See Sizing NameNode Heap Memory

Set this value using the Java Heap Size of NameNode in Bytes HDFS configuration property.

Minimum of 4 dedicated cores; more may be required for larger clusters
  • Minimum of 2 dedicated disks for metadata
  • 1 dedicated disk for log files (This disk may be shared with the operating system.)
  • Maximum disks: 4
DataNode Minimum: 1 GB

Increase the memory for higher replica counts or a higher number of blocks per DataNode.

Set this value using the Java Heap Size of DataNode in Bytes HDFS configuration property.

Minimum: 4 cores. Add more cores for highly active clusters.

Minimum: 4

Maximum: 24

The maximum acceptable size will vary depending upon how large average block size is. The DN’s scalability limits are mostly a function of the number of replicas per DN, not the overall number of bytes stored. That said, having ultra-dense DNs will affect recovery times in the event of machine or rack failure -- we recommend no more than 100 TB raw storage per DN. The impact of this is largely a function of the available network capacity in the cluster.

HBase

Component Java Heap CPU Disk
Master
  • 100-10,000 regions: 4 GB
  • 10,000 or more regions with 200 or more Region Servers: 8 GB
  • 10,000 or more regions with 300 or more Region Servers: 12 GB

Set this value using the Java Heap Size of HBase Master in Bytes HBase configuration property.

Minimum 4 dedicated cores. You can add more cores for larger clusters, when using replication, or for bulk loads. 1 disk for local logs, which can be shared with the operating system and/or other Hadoop logs
Region Server

Set this value using the Java Heap Size of HBase RegionServer in Bytes HBase configuration property.

Minimum: 4 dedicated cores 4 disks for each DataNode

Hive

Component Java Heap CPU Disk
HiveServer 2 Single Connection 4 GB
2-10 connections 4-10 GB
11-20 connections 6-12 GB
21-40 connections 12-16 GB
41 to 80 connections 16-24 GB

Cloudera recommends splitting HiveServer2 into multiple instances and load balancing them once you start allocating more than 12 GB to HiveServer2. The objective is to adjust the size to reduce the impact of Java garbage collection on active processing by the service.

Set this value using the Java Heap Size of HiveServer2 in Bytes Hive configuration property.

Hive Metastore Single Connection 4 GB
2-10 connections 4-10 GB
11-20 connections 12-12 GB
21-40 connections 12-16 GB
41 to 80 connections 16-24 GB

Set this value using the Java Heap Size of Hive Metastore Server in Bytes Hive configuration property.

Beeline CLI Minimum: 2 GB

Hive on Spark

Component Java Heap CPU Disk
Hive-on-Spark
  • Minimum: 16 GB
  • Recommended: 32 GB for larger data sizes

Individual executor heaps should be no larger than 16 GB so machines with more RAM can use multiple executors.

  • Minimum: 4 cores
  • Recommended: 8 cores for larger data sizes
For more information on how to reserve YARN cores and memory that will be used by Spark executors, refer to Tuning Apache Hive on Spark in CDH.

Hue

Component Memory CPU Disk
Hue Server
  • Minimum: 4 GB
  • Maximum 10 GB
  • If the cluster uses the Hue load balancer, add additional memory
Minimum: 1 Core to run Django

When Hue is configured for high availability, add additional cores

Minimum: 10 GB for the database, which grows proportionally according to the cluster size and workloads.

When Hue is configured for high availability, add temp space is required

For more information about Hue high availability, see How to Add a Hue Load Balancer.

Impala

Component Memory CPU Disk
Impalad Processes, Catalog Server, StateStore
  • Minimum: 32 GB
  • Recommended: 128 GB

Set the heap size for Catalog Server using the Java Heap Size of Catalog Server in Bytes configuration property.

  • Minimum: 4
  • Recommended: 16 or more

CPU instruction set support: SSSE3

  • Minimum: 1 disk
  • Recommended: 8 or more

Sizing requirements for Impala can vary significantly depending on the size and types of workloads using Impala. For a complete discussion of how to calculate Impala sizing, see Cluster Sizing Guidelines for Impala.

Networking Topology for multi-rack cluster: Leaf-Spine.

Kafka

Kafka requires a fairly small amount of resources, especially with some configuration tuning. By default, Kafka, can run on little as 1 core and 1GB memory with storage scaled based on requirements for data retention.

CPU is rarely a bottleneck because Kafka is I/O heavy, but a moderately-sized CPU with enough threads is still important to handle concurrent connections and background tasks.

Kafka brokers tend to have a similar hardware profile to HDFS data nodes. How you build them depends on what is important for your Kafka use cases. Use the following guidelines:
To affect performance of these features: Adjust these parameters:
Message Retention Disk size
Client Throughput (Producer & Consumer) Network capacity
Producer throughput Disk I/O
Consumer throughput Memory

A common choice for a Kafka node is as follows:

Component Memory/Java Heap CPU Disk
Broker
  • RAM: 64 GB
  • Recommended Java heap: 4 GB

Set this value using the Java Heap Size of Broker Kafka configuration property.

See

12- 24 cores
  • 1 HDD For operating system
  • 1 HDD for Zookeeper dataLogDir (using SSDs may provide additional performance)
  • 10- HDDs, using Raid 10, for Kafka data
Mirror Maker 1 GB heap

Set this value using the Java Heap Size of MirrorMaker Kafka configuration property.

   

Networking requirements: Gigabit Ethernet or 10 Gigabit Ethernet

Key Trustee

Component Memory CPU Disk
Key Trustee Server 8 GB 1 GHz 64-bit quad core 20 GB, using moderate to high-performance drives
Key Trustee KMS 16 GB 2 GHz 64-bit quad core 40 GB, using moderate to high-performance drives

Kudu

Component Memory CPU Disk
Tablet Server
  • Minimum: 4 GB
  • Recommended: 10 GB

Additional hardware may be required, depending on the workloads running in the cluster. If you are using Impala, see the Impala sizing guidelines.

  1 disk for write-ahead log (WAL). Using an SSD drive may improve performance.
Master
  • Minimum: 256 MB
  • Recommended: 1 GB
  1 disk

For more information, see Kudu Server Management.

Oozie

Component Java Heap CPU Disk
Oozie
  • Minimum: 1 GB (this is the default set by Cloudera Manager). This is sufficient for less than 10 simultaneous workflows, without forking.
  • If you notice excessive garbage collection, or out-of-memory errors, increase the heap size to 4 GB for medium-size production clusters or to 8 GB for large-size production clusters.

Set this value using the Java Heap Size of Oozie Server in Bytes Oozie configuration property.

n/r n/r

Additional tuning:

For workloads with many coordinators that run with complex workflows (a max concurrency reached! warning appears in the log and the Oozie admin -queuedump command shows a large queue):
  • Increase the value of the oozie.service.CallableQueueService.callable.concurrency property to 50.
  • Increase the value of the oozie.service.CallableQueueService.threads property to 200.

Do not use a Derby database as a backend database for Oozie.

Search

Component Java Heap CPU Disk
Solr
  • Small workloads, or evaluations: 16 GB
  • Smaller production environments: 32 GB
  • Larger production environments: 96 GB is sufficient for most clusters.

Set this value using the Java Heap Size of Solr Server in Bytes Solr configuration property.

See

  • Minimum: 4
  • Recommended: 16 for production workloads
No requirement. Solr uses HDFS for storage.
Note the following considerations for determining the optimal amount of heap memory:
  • Size of searchable material: The more searchable material you have, the more memory you need. All things being equal, 10 TB of searchable data requires more memory than 1 TB of searchable data.
  • Content indexed in the searchable material: Indexing all fields in a collection of logs, email messages, or Wikipedia entries requires more memory than indexing only the Date Created field.
  • The level of performance required: If the system must be stable and respond quickly, more memory may help. If slow responses are acceptable, you may be able to use less memory.

For more information refer to Deployment Planning for Cloudera Search.

Sentry

Component Java Heap CPU Disk
Sentry Server
  • Minimum: 576 MB
  • Recommended: 2.5 GB per million objects in the Hive database (servers, databases, tables, partitions, columns, URIs, and views)

Set this value with the Java Heap Size of Sentry Server in Bytes Sentry configuration property.

Minimum: 4  

For more information about Sentry requirements, see Before You Install Sentry.

Spark

Component Java Heap CPU Disk
Spark History Server Minimum: 512 GB

Set this value using the Java Heap Size of History Server in Bytes Spark configuration property.

   

YARN

Component Java Heap CPU Other Recommendations
Job History Server
  • Minimum: 1 GB
  • Increase memory by 1 GB for each 100,000 tasks kept in memory. For example:

    5 jobs @ 100, 000 mappers + 30,000 reducers = 650,000 total tasks requiring 6.5 GB of heap.

    See the Other Recommendations column for additional tuning suggestions.

Set this value using the Java Heap Size of JobHistory Server in Bytes YARN configuration property.

Minimum: 1 core
  • Set the mapreduce.jobhistory.jhist.format property to binary (history files will load about 2x-3x faster with this setting). Available in CDH 5.5.0 or higher only.

  • Set the mapreduce.jobhistory.loadedtasks.cache.size property to a total loaded task count. Using the example in the Java Heap column to the left, of 650,000 total tasks, you can set it to 700,000 to allow for some safety margin. This should also prevent the JobHistoryServer from hanging during garbage collection, since the job count limit does not have a task limit.

For 4 GB container sizes:
  • Minimum: 64 GB
  • Recommended: 256 GB

Set this value using the Java Heap Size of NodeManager in Bytes YARN configuration property.

  • Minimum: 8-16 cores
  • Recommended: 32-64 cores
Disks:
  • Minimum: 8 disks
  • Recommended: 12 or more disks
Networking:
  • Minimum: Dual 1Gbps or faster
  • Recommended: Single/Dual 10 Gbps or faster
ResourceManager Minimum: 1 GB
Configure additional heap memory for the following conditions:
  • More jobs
  • Larger cluster size
  • Number of retained finished applications (configured with the yarn.resourcemanager.max-completed-applications property.
  • Scheduler configuration

Set this value using the Java Heap Size of ResourceManager in Bytes YARN configuration property.

Minimum: 1 core  
Other Settings
  • Set the ApplicationMaster Memory YARN configuration property to 512 MB
  • Set the Container Memory Minimum YARN configuration property to 1 GB; Increase this value for jobs that require more than 10,000 tasks.
   

For more information, see Tuning YARN.

ZooKeeper

Component Java Heap CPU Disk
 ZooKeeper Server 
  • Minimum: 1 GB
  • Increase heap size when watching 10,000 - 100,000 ephemeral znodes and are using 1,000 or more clients.

Set this value using the Java Heap Size of ZooKeeper Server in Bytes ZooKeeper configuration property.

 

ZooKeeper was not designed to be a low-latency service and does not benefit from the use of SSD drives. The ZooKeeper access patterns – append-only writes and sequential reads – were designed with spinning disks in mind. Therefore Cloudera recommends using HDD drives.