I want to bring this discussion to the dev list to gather broader feedback,
as there have been some discussions that happened over multiple JIRA
tickets (SPARK16026 <https://issues.apache.org/jira/browse/SPARK16026>,
etc) and GitHub pull requests about what statistics to collect and how to
use them.
There are some basic statistics on columns that are obvious to use and we
don't need to debate these: estimated size (in bytes), row count, min, max,
number of nulls, number of distinct values, average column length, max
column length.
In addition, we want to be able to estimate selectivity for equality and
range predicates better, especially taking into account skewed values and
outliers.
Before I dive into the different options, let me first explain countmin
sketch: Countmin sketch is a common sketch algorithm that tracks frequency
counts. It has the following nice properties:
 sublinear space
 can be generated in onepass in a streaming fashion
 can be incrementally maintained (i.e. for appending new data)
 it's already implemented in Spark
 more accurate for frequent values, and less accurate for lessfrequent
values, i.e. it tracks skewed values well.
 easy to compute inner product, i.e. trivial to compute the countmin
sketch of a join given two countmin sketches of the join tables
Proposal 1 is is to use a combination of countmin sketch and equiheight
histograms. In this case, countmin sketch will be used for selectivity
estimation on equality predicates, and histogram will be used on range
predicates.
Proposal 2 is to just use countmin sketch on equality predicates, and then
simple selected_range / (max  min) will be used for range predicates. This
will be less accurate than using histogram, but simpler because we don't
need to collect histograms.
Proposal 3 is a variant of proposal 2, and takes into account that skewed
values can impact selectivity heavily. In 3, we track the list of heavy
hitters (HH, most frequent items) along with countmin sketch on the
column. Then:
 use countmin sketch on equality predicates
 for range predicates, estimatedFreq = sum(freq(HHInRange)) + range /
(max  min)
Proposal 4 is to not use any sketch, and use histogram for high cardinality
columns, and exact (value, frequency) pairs for low cardinality columns
(e.g. num distinct value <= 255).
Proposal 5 is a variant of proposal 4, and adapts it to track exact (value,
frequency) pairs for the most frequent values only, so we can still have
that for high cardinality columns. This is actually very similar to
countmin sketch, but might use less space, although requiring two passes
to compute the initial value, and more difficult to compute the inner
product for joins.
