impala-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Behm (JIRA)" <j...@apache.org>
Subject [jira] [Created] (IMPALA-5360) Prefer exact partition row counts in scan cardinality estimation with row count extrapolation enabled.
Date Wed, 24 May 2017 20:00:07 GMT
Alexander Behm created IMPALA-5360:
--------------------------------------

             Summary: Prefer exact partition row counts in scan cardinality estimation with
row count extrapolation enabled.
                 Key: IMPALA-5360
                 URL: https://issues.apache.org/jira/browse/IMPALA-5360
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 2.9.0
            Reporter: Alexander Behm


With IMPALA-2373 we added row count extrapolation for scan-cardinality estimation. The simple
approach taken in IMPALA-2373 has the downside that per-partition row counts are ignored in
favor of always extrapolating based on the number of bytes scanned.

That choice the following undesirable side-effect: Immediately after running COMPUTE STATS,
the estimated row count of a subset of partitions may be slightly inaccurate, even if the
precise row count could have been determined from the per-partition stats. The problem is
that we do not know if the partitions have changed since the last COMPUTE STATS, and so we
always prefer to extrapolate to accommodate potential changes.

A possible improvement is to implement per-partition change detection, e.g., by computing
and storing a hash of file names and sizes at the time of COMPUTE STATS. Later, during cardinality
estimation we can use the hash to determine whether partitions have changed since the last
stats collection and decide whether to use the exact per-partition row count or extrapolate.

It is unclear how exactly this change detection should interact with COMPUTE STATS TABLESAMPLE
(IMPALA-5310).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message