phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ethan Wang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PHOENIX-4160) research for a proper hash size set for APPROX_COUNT_DISTINCT
Date Tue, 05 Sep 2017 21:01:00 GMT

     [ https://issues.apache.org/jira/browse/PHOENIX-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ethan Wang updated PHOENIX-4160:
--------------------------------
    Description: 
Now after -PHOENIX-418- finishes, we want to study to find a proper default size for hll hash.
Currently the hash size is hard coded as 25/16 bits by default (a design we follow Apache
Druid. discussion see CALCITE-1588).

Note:
1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e.,  {code:java}sqrt(3\*ln(2)-1)/sqrt(2^precision){code}
Detail see the page 129 of this [paper|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf].
To try on a bigger size, the performance of hll under different bucket/hash size has been
studied here: https://metron.apache.org/current-book/metron-analytics/metron-statistics/HLLP.html

2, When the estimate cardinalities is large enough, as Timok and Flajolet et al found out,
this performance of hll will become problematic because the hash collisions (saturation).
In fact, Timok proposed that any number larger than {code}2^{32}/30{code} should consider
"to large" for a 32 bit hash. See study [Google’s Take On Engineering HLL|https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/]
and suggested by the Figure 8 of this [paper|https://stefanheule.com/papers/edbt13-hyperloglog.pdf]

Alternatively we can instruct user to not only use it in exceedingly huge cardinally scenario.

  was:
Now after -PHOENIX-418- finishes, we want to study to find a proper default size for hll hash.
Currently the hash size is hard coded as 25/16 bits by default (a design we follow Apache
Druid. discussion see CALCITE-1588).

Note:
1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e.,  {code:java}sqrt(3\*ln(2)-1)/sqrt(2^precision){code}
Detail see the page 129 of this [paper|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf].
To try on a bigger size, the performance of hll under different bucket/hash size has been
studied here: https://metron.apache.org/current-book/metron-analytics/metron-statistics/HLLP.html

2, When the estimate cardinalities is large enough, as Timok and Flajolet et al found out,
this performance of hll will become problematic because the hash collisions (saturation).
In fact, Timok proposed that any number larger than {code}2^{32}/30{code} should consider
"to large" for a 32 bit hash. See study [Google’s Take On Engineering HLL|https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/]
and the Figure 8 of [paper|https://stefanheule.com/papers/edbt13-hyperloglog.pdf]

Alternatively we can instruct user to not only use it in exceedingly huge cardinally scenario.


> research for a proper hash size set for APPROX_COUNT_DISTINCT
> -------------------------------------------------------------
>
>                 Key: PHOENIX-4160
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4160
>             Project: Phoenix
>          Issue Type: Improvement
>         Environment: 
>            Reporter: Ethan Wang
>
> Now after -PHOENIX-418- finishes, we want to study to find a proper default size for
hll hash. Currently the hash size is hard coded as 25/16 bits by default (a design we follow
Apache Druid. discussion see CALCITE-1588).
> Note:
> 1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e.,  {code:java}sqrt(3\*ln(2)-1)/sqrt(2^precision){code}
Detail see the page 129 of this [paper|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf].
> To try on a bigger size, the performance of hll under different bucket/hash size has
been studied here: https://metron.apache.org/current-book/metron-analytics/metron-statistics/HLLP.html
> 2, When the estimate cardinalities is large enough, as Timok and Flajolet et al found
out, this performance of hll will become problematic because the hash collisions (saturation).
In fact, Timok proposed that any number larger than {code}2^{32}/30{code} should consider
"to large" for a 32 bit hash. See study [Google’s Take On Engineering HLL|https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/]
and suggested by the Figure 8 of this [paper|https://stefanheule.com/papers/edbt13-hyperloglog.pdf]
> Alternatively we can instruct user to not only use it in exceedingly huge cardinally
scenario.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message