phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ethan Wang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PHOENIX-3390) Custom UDAF for HyperLogLogPlus
Date Tue, 20 Jun 2017 23:17:00 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056597#comment-16056597
] 

Ethan Wang edited comment on PHOENIX-3390 at 6/20/17 11:16 PM:
---------------------------------------------------------------

Had a discussion with [~talktoswapna] today. Thanks to her time we clarified some items regarding
this feature. Now I'd like to add some more detail to the description of this ticket.

First, HyperLogLog is an algorithm that is only good at one particular task: Count of distinct
items in a multi-set. It does not help on generic user-defined aggregation function, such
as APPROX_SUM, APPROX_MAX etc. (The APPROX_SUM example Swapna provided above is actually the
count of the union of two multi-sets, but it's still count)

Secondly, HyperLogLog is optimizing on "Space usage", so that it is able to do the count on
a huge data set (before it is impossible do so, because you have to remember each distinct
item in memory). So HLL is a "can't-do" to "can-do" improvement, not a "slow" to "faster"
improvement: It still has to full scan through every item, there is no "sampling" going on
in the process!

Besides HLL, there do exist other algorithms and their implementations that will achieve other
types of user defined approximate aggregation. Those works such as SuperLogLog, KLL(sketching
algorithm by Zohar et al), the work by Manku, Rajagopalan et al (1), and the work by Munro
and Paterson et al (2).  Among which, Yahoo's library DataSketches(https://datasketches.github.io.
thanks for recommending @Rahul G) provides the following Approximate aggregations that may
be particularly interesting for Phoenix. Besides HLL based APPROX_COUNT, there are :
    APPROX_MEDIAN
    APPROX_PERCENTILE (like95 Percentile)
    APPROX_MIN/MAX
    APPROX_KMV (the famous Kth Minimum Value)


=================================================================================
Now, everything above are space-optimized approaches. We could, have a different set of user
defined functions for approximate aggregation that is time-optimized based.  i.e., run faster.
Such as a use case below: "I'm able to do select count( * ) from A, but I don't want to, because
full scan of A take too long. Can I just count on 10% sample and get a approximate that way?"

Along this line, the similar approaches would be BlinkDB(https://sameeragarwal.github.io/blinkdb_eurosys13.pdf),
PHOENIX-153 tablesample.

Phoenix community please advice the direction we are going. 

[~jamestaylor][~gjacoby]





(1). Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Random sampling tech-niques
for space efficient online computation of order statistics of large datasets.
(2). J.I. Munro and M.S. Paterson. Selection and sorting with limited storage.


was (Author: aertoria):
Had a discussion with [~talktoswapna] today. Thanks to her time we clarified some items regarding
this feature. Now I'd like to add some more detail to the description of this ticket.

First, HyperLogLog is an algorithm that is only good at one particular task: Count of distinct
items in a multi-set. It does not help on generic user-defined aggregation function, such
as APPROX_SUM, APPROX_MAX etc. (The APPROX_SUM example Swapna provided above is actually the
count of the union of two multi-sets, but it's still count)

Secondly, HyperLogLog is optimizing on "Space usage", so that it is able to do the count on
a huge data set (before it is impossible do so, because you have to remember each distinct
item in memory). So HLL is a "can't-do" to "can-do" improvement, not a "slow" to "faster"
improvement: It still has to full scan through every item, there is no "sampling" going on
in the process!

Besides HLL, there do exist other algorithms and their implementations that will achieve other
types of user defined approximate aggregation. Those works such as SuperLogLog, KLL(sketching
algorithm by Zohar et al), the work by Manku, Rajagopalan et al (1), and the work by Munro
and Paterson et al (2).  Among which, Yahoo's library DataSketches(https://datasketches.github.io.
thanks for recommending @Rahul G) provides the following Approximate aggregations that may
be particularly interesting for Phoenix. Besides HLL based APPROX_COUNT, there are :
    APPROX_MEDIAN
    APPROX_PERCENTILE (like95 Percentile)
    APPROX_MIN/MAX
    APPROX_KMV (the famous Kth Minimum Value)


=================================================================================
Now, everything above are space-optimized approaches. We could, have a different set of user
defined functions for approximate aggregation that is time-optimized based.  i.e., run faster.

Such as a use case below: "I'm able to do select count( * ) from A, but I don't want to, because
full scan of A take too long. Can I just count on 10% sample and get a approximate that way?"

Along this line, the similar approaches would be BlinkDB, PHOENIX-153 tablesample.

Phoenix community please advice the direction we are going. 

[~jamestaylor][~gjacoby]





(1). Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Random sampling tech-niques
for space efficient online computation of order statistics of large datasets.
(2). J.I. Munro and M.S. Paterson. Selection and sorting with limited storage.

> Custom UDAF for HyperLogLogPlus
> -------------------------------
>
>                 Key: PHOENIX-3390
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3390
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: Swapna Kasula
>            Priority: Minor
>
> With ref # PHOENIX-2069
> Custome UDAF to aggregate/union of Hyperloglog's of a column and returns a Hyperloglog.
> select hllUnion(col1) from table;  //returns a Hyperloglog, which is the union of all
hyperloglog's from all rows for column 'col1'



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message