[ https://issues.apache.org/jira/browse/PHOENIX4160?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Ethan Wang updated PHOENIX4160:

Description:
Now after PHOENIX418 finishes, currently the hash size is hard coded as 25/16 bits by default
(a design we follow Apache Druid. discussion see CALCITE1588). And now we want to study to
find a proper size.
Note:
1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e., sqrt(3\*ln(2)1)/sqrt(2^precision).
see the page 129 of this [paperhttp://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf]
the performance of hll under different bucket/hash size has been studied.
See details here: https://metron.apache.org/currentbook/metronanalytics/metronstatistics/HLLP.html
2, When the estimate cardinalities is large enough, this performance of hll will become problematic
because the hash collisions (saturation). For detail see the study done by Flajolet et al.
In fact, Timok proposed that any number larger than 2^{32}/30 should consider "to large" for
a 32 bit hash. See study [Google’s Take On Engineering HLLhttps://research.neustar.biz/2013/01/24/hyperlogloggooglestakeonengineeringhll/]
and the Figure 8 of [paperhttps://stefanheule.com/papers/edbt13hyperloglog.pdf]
Alternatively we can instruct user to not only use it in exceedingly huge cardinally scenario.
was:
Now after PHOENIX418 finishes, currently the hash size is hard coded as 25/16 bits by default
(a design we followed Apache Druid. discussion see CALCITE1588). And now we want to study
to find a proper size.
Note:
1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e., sqrt(3\*ln(2)1)/sqrt(2^precision).
see the page 129 of this [paperhttp://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf]
the performance of hll under different bucket/hash size has been studied.
See details here: https://metron.apache.org/currentbook/metronanalytics/metronstatistics/HLLP.html
2, When the estimate cardinalities is large enough, this performance of hll will become problematic
because the hash collisions (saturation). For detail see the study done by Flajolet et al.
In fact, Timok proposed that any number larger than 2^{32}/30 should consider "to large" for
a 32 bit hash. See study [Google’s Take On Engineering HLLhttps://research.neustar.biz/2013/01/24/hyperlogloggooglestakeonengineeringhll/]
and the Figure 8 of [paperhttps://stefanheule.com/papers/edbt13hyperloglog.pdf]
Alternatively we can instruct user to not only use it in exceedingly huge cardinally scenario.
> research for a proper hash size set for APPROX_COUNT_DISTINCT
> 
>
> Key: PHOENIX4160
> URL: https://issues.apache.org/jira/browse/PHOENIX4160
> Project: Phoenix
> Issue Type: Improvement
> Environment: [link titlehttp://example.com]
> Reporter: Ethan Wang
>
> Now after PHOENIX418 finishes, currently the hash size is hard coded as 25/16 bits
by default (a design we follow Apache Druid. discussion see CALCITE1588). And now we want
to study to find a proper size.
> Note:
> 1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e., sqrt(3\*ln(2)1)/sqrt(2^precision).
> see the page 129 of this [paperhttp://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf]
> the performance of hll under different bucket/hash size has been studied.
> See details here: https://metron.apache.org/currentbook/metronanalytics/metronstatistics/HLLP.html
> 2, When the estimate cardinalities is large enough, this performance of hll will become
problematic because the hash collisions (saturation). For detail see the study done by Flajolet
et al. In fact, Timok proposed that any number larger than 2^{32}/30 should consider "to large"
for a 32 bit hash. See study [Google’s Take On Engineering HLLhttps://research.neustar.biz/2013/01/24/hyperlogloggooglestakeonengineeringhll/]
and the Figure 8 of [paperhttps://stefanheule.com/papers/edbt13hyperloglog.pdf]
> Alternatively we can instruct user to not only use it in exceedingly huge cardinally
scenario.

This message was sent by Atlassian JIRA
(v6.4.14#64029)
