impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <>
Subject [Impala-ASF-CR] [DOCS] Tighten up advice about first COMPUTE INCREMENTAL STATS
Date Fri, 06 Oct 2017 04:20:20 GMT
Alex Behm has posted comments on this change. ( )

Change subject: [DOCS] Tighten up advice about first COMPUTE INCREMENTAL STATS

Patch Set 1:

File docs/shared/impala_common.xml:
PS1, Line 1226:         and the statistics are computed again from the beginning. Therefore,
expect a one-time
from scratch
PS1, Line 1241: -- by -1 under #Rows and false under Incremental stats.
I suggest you leave out the -1 under #Rows part since that may be confusing. The reason is
that DROP INCREMENTAL STATS will *not* modify the #Rows.

Here's how you can think about incremental stats:
COMPUTE INCREMENTAL STATS populates the "regular" stats such as the #rows and column NDVs
that COMPUTE STATS also does, but in addition it also stores "incremental stats" to speed
up the next COMPUTE INCREMENTAL STATS. So the "incremental" part is really this extra information
which you can drop separately from the "regular" stats.

One nice thing is that you can safely DROP INCREMENTAL STATS everywhere to reduce the size
of table metadata without impacting query plans because the "regular" stats are preserved.
File docs/topics/impala_partitioning.xml:
PS1, Line 611:         Because the <codeph>COMPUTE STATS</codeph> statement can
be resource-intensive to run frequently
This advice isn't prescriptive enough for my taste. We should state very clearly that you
should use either COMPUTE STATS xor COMPUTE INCREMENTAL STATS but never both. Switching during
the lifetime of a table is *not* recommended, but if you really must do so then we recommend
you first drop all stats before the switch (using DROP STATS and DROP INCREMENTAL STATS).
PS1, Line 613:         that is optimized for processing partitioned tables.
I wouldn't say that incremental stats is "optimized" for partitioned tables. Foremost, incremental
stats allow you to compute stats in a partition-by-partition fashion which might be a better
fit for a user's data ingestion pattern. However, we should be very clear about the cost of
incremental stats. Incremental stats need ~400bytes per column per partition in the table
metadata (which gets disseminated and cached everywhere), so incremental stats it not a good
fit for tables with a huge number of columns and partitions. If you have a partitioned table
and only a few of the partitions are "active" then you can compute incremental stats for new
partitions coming in and drop incremental stats for those partitions "phased" out to limit
your exposure to the metadata size problems.

You can even state that the huge table metadata can crash the catalog and/or impalads due
to the Java 2GB array size limit. (We're working on fixing that)

Basically I want to be sure that users understand the cost of incremental stats and the impact
(crash) of when they go overboard with incremental stats. There is no graceful degradation

To view, visit
To unsubscribe, visit

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia53a6518ce5541e5c9a2cd896856ce042a599b03
Gerrit-Change-Number: 7999
Gerrit-PatchSet: 1
Gerrit-Owner: John Russell <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Greg Rahn <>
Gerrit-Reviewer: Mostafa Mokhtar <>
Gerrit-Reviewer: Silvius Rus <>
Gerrit-Comment-Date: Fri, 06 Oct 2017 04:20:20 +0000
Gerrit-HasComments: Yes

  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message