phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ethan Wang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (PHOENIX-4160) research for a proper hash size set for APPROX_COUNT_DISTINCT
Date Tue, 05 Sep 2017 20:40:03 GMT
Ethan Wang created PHOENIX-4160:
-----------------------------------

             Summary: research for a proper hash size set for APPROX_COUNT_DISTINCT
                 Key: PHOENIX-4160
                 URL: https://issues.apache.org/jira/browse/PHOENIX-4160
             Project: Phoenix
          Issue Type: Improvement
         Environment: [link title|http://example.com]
            Reporter: Ethan Wang


Now after -PHOENIX-418- finishes, currently the hash size is hard coded as 25/16 bits by default
(a design we follow Apache Druid. discussion see CALCITE-1588). And now we want to study to
find a proper size. 

Note:
1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e.,  sqrt(3\*ln(2)-1)/sqrt(2^precision).
see the page 129 of this [paper|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf]
the performance of hll under different bucket/hash size has been studied. 
See details here: https://metron.apache.org/current-book/metron-analytics/metron-statistics/HLLP.html

2, When the estimate cardinalities is large enough, this performance of hll will become problematic
because the hash collisions (saturation). For detail see the study done by Flajolet et al.
In fact, Timok proposed that any number larger than 2^{32}/30 should consider "to large" for
a 32 bit hash. See study [Google’s Take On Engineering HLL|https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/]
and the Figure 8 of [paper|https://stefanheule.com/papers/edbt13-hyperloglog.pdf]





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message