Cluster Sizing Guidelines for Impala

This document provides a very rough guideline to estimate the size of a cluster needed for a specific customer application. You can use this information when planning how much and what type of hardware to acquire for a new cluster, or when adding Impala workloads to an existing cluster.

Always use hosts with identical specifications and capacities for all the nodes in the cluster. Currently, Impala divides the work evenly between cluster nodes, regardless of their exact hardware configuration. Because work can be distributed in different ways for different queries, if some hosts are overloaded compared to others in terms of CPU, memory, I/O, or network, you might experience inconsistent performance and overall slowness

For analytic workloads with star/snowflake schemas, and using consistent hardware for all nodes (64 GB RAM, 12 2 TB hard drives, 2x E5-2630L 12 cores total, 10 GB network), the following table estimates the number of DataNodes needed in the cluster based on data size and the number of concurrent queries, for workloads similar to TPC-DS benchmark queries:

Cluster size estimation based on the number of concurrent queries and data size with a 20 second average query response time
Data Size 1 query 10 queries 100 queries 1000 queries 2000 queries
250 GB 2 2 5 35 70
500 GB 2 2 10 70 135
1 TB 2 2 15 135 270
15 TB 2 20 200 N/A N/A
30 TB 4 40 400 N/A N/A
60 TB 8 80 800 N/A N/A

Factors Affecting Scalability

A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with cluster size. However, for some workloads, the scalability might be bounded by the network, or even by memory.

If the workload is already network bound (on a 10 GB network), increasing the cluster size won’t reduce the network load; in fact, a larger cluster could increase network traffic because some queries involve "broadcast" operations to all DataNodes. Therefore, boosting the cluster size does not improve query throughput in a network-constrained environment.

Let’s look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor network is saturated yet. This can happen because currently Impala uses only a single core per node to process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system cannot run more than 2 such queries at the same time.

Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound workload. It’s just that the CPU will not be saturated. Per-node throughput will be lower than 1.6 GB/sec. Consider increasing the memory per node.

As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the throughput estimate.

A More Precise Approach

A more precise sizing estimate would require not only queries per minute (QPM), but also an average data size scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total data size. The following equation can be used as a rough guide to estimate the number of nodes (N) needed:

Eq 1: N > QPM * D / 100 GB

Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is required to be 15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can estimate the number of node using our equation above.

N > QPM * D / 100GB
N > 400 * 50GB / 100GB
N > 200

Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500.

Depending on the complexity of the query, the processing rate of query might change. If the query has more joins, aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the process rate will be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and filtering on numbers, the processing rate can be higher.

Estimating Memory Requirements

Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the joined tables, using the COMPUTE STATS statement. However, joining big tables does consume more memory. Follow the steps below to calculate the minimum memory requirement.

Suppose you are running the following join:

select a.*, b.col_1, b.col_2, … b.col_n
from a, b
where a.key = b.key
and b.col_1 in (1,2,4...)
and b.col_4 in (....);

And suppose table B is smaller than table A (but still a large table).

The memory requirement for the query is the right-hand table (B), after decompression, filtering (b.col_n in ...) and after projection (only using certain columns) must be less than the total memory of the entire cluster.

Cluster Total Memory Requirement  = Size of the smaller table *
  selectivity factor from the predicate *
  projection factor * compression ratio

In this case, assume that table B is 100 TB in Parquet format with 200 columns. The predicate on B (b.col_1 in ...and b.col_4 in ...) will select only 10% of the rows from B and for projection, we are only projecting 5 columns out of 200 columns. Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor.

Cluster Total Memory Requirement  = Size of the smaller table *
  selectivity factor from the predicate *
  projection factor * compression ratio
  = 100TB * 10% * 5/200 * 3
  = 0.75TB
  = 750GB

So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1 TB of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries of this magnitude.