CHAR Data Type (<keyword keyref="impala20"/> or higher only)

+ +

+ cluster sizing + This document provides a very rough guideline to estimate the size of a cluster needed for a specific + customer application. You can use this information when planning how much and what type of hardware to + acquire for a new cluster, or when adding Impala workloads to an existing cluster. +

+ + + Before making purchase or deployment decisions, consult your Cloudera representative to verify the + conclusions about hardware requirements based on your data volume and workload. + + + + +

+ Always use hosts with identical specifications and capacities for all the nodes in the cluster. Currently, + Impala divides the work evenly between cluster nodes, regardless of their exact hardware configuration. + Because work can be distributed in different ways for different queries, if some hosts are overloaded + compared to others in terms of CPU, memory, I/O, or network, you might experience inconsistent performance + and overall slowness +

+ +

+ For analytic workloads with star/snowflake schemas, and using consistent hardware for all nodes (64 GB RAM, + 12 2 TB hard drives, 2x E5-2630L 12 cores total, 10 GB network), the following table estimates the number of + DataNodes needed in the cluster based on data size and the number of concurrent queries, for workloads + similar to TPC-DS benchmark queries: +

+ + + Cluster size estimation based on the number of concurrent queries and data size with a 20 second average query response time + + + + + + + + + + + Data Size + + + 1 query + + + 10 queries + + + 100 queries + + + 1000 queries + + + 2000 queries + + + + + + + 250 GB + + + 2 + + + 2 + + + 5 + + + 35 + + + 70 + + + + + 500 GB + + + 2 + + + 2 + + + 10 + + + 70 + + + 135 + + + + + 1 TB + + + 2 + + + 2 + + + 15 + + + 135 + + + 270 + + + + + 15 TB + + + 2 + + + 20 + + + 200 + + + N/A + + + N/A + + + + + 30 TB + + + 4 + + + 40 + + + 400 + + + N/A + + + N/A + + + + + 60 TB + + + 8 + + + 80 + + + 800 + + + N/A + + + N/A + + + + +

+ +

+ + Factors Affecting Scalability + +

+ A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each + node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with + cluster size. However, for some workloads, the scalability might be bounded by the network, or even by + memory. +

+ +

+ If the workload is already network bound (on a 10 GB network), increasing the cluster size won’t reduce + the network load; in fact, a larger cluster could increase network traffic because some queries involve + broadcast operations to all DataNodes. Therefore, boosting the cluster size does not improve query + throughput in a network-constrained environment. +

+ +

+ Let’s look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional + concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor + network is saturated yet. This can happen because currently Impala uses only a single core per node to + process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system + cannot run more than 2 such queries at the same time. +

+ +

+ Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound + workload. It’s just that the CPU will not be saturated. Per-node throughput will be lower than 1.6 + GB/sec. Consider increasing the memory per node. +

+ +

+ As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the + throughput estimate. +

+ +

+ + A More Precise Approach + +

+ A more precise sizing estimate would require not only queries per minute (QPM), but also an average data + size scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total + data size. The following equation can be used as a rough guide to estimate the number of nodes (N) needed: +

+ +Eq 1: N > QPM * D / 100 GB + + +

+ Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is + required to be 15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can + estimate the number of node using our equation above. +

+ +N > QPM * D / 100GB +N > 400 * 50GB / 100GB +N > 200 + + +

+ Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500. +

+ +

+ Depending on the complexity of the query, the processing rate of query might change. If the query has more + joins, aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the + process rate will be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and + filtering on numbers, the processing rate can be higher. +

+ +

+ + Estimating Memory Requirements + + +

+ Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the + joined tables, using the COMPUTE + STATS statement. However, joining big tables does consume more memory. Follow the steps + below to calculate the minimum memory requirement. +

+ +

+ Suppose you are running the following join: +

+ +select a.*, b.col_1, b.col_2, … b.col_n +from a, b +where a.key = b.key +and b.col_1 in (1,2,4...) +and b.col_4 in (....); + + +

+ And suppose table B is smaller than table A (but still a large table). +

+ +

+ The memory requirement for the query is the right-hand table (B), after decompression, + filtering (b.col_n in ...) and after projection (only using certain columns) must be less + than the total memory of the entire cluster. +

+ +Cluster Total Memory Requirement = Size of the smaller table * + selectivity factor from the predicate * + projection factor * compression ratio + + +

+ In this case, assume that table B is 100 TB in Parquet format with 200 columns. The + predicate on B (b.col_1 in ...and b.col_4 in ...) will select only 10% of + the rows from B and for projection, we are only projecting 5 columns out of 200 columns. + Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor. +

+ +Cluster Total Memory Requirement = Size of the smaller table * + selectivity factor from the predicate * + projection factor * compression ratio + = 100TB * 10% * 5/200 * 3 + = 0.75TB + = 750GB + + +

+ So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1 + TB of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries + of this magnitude. +