flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fabian Hueske (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-7465) Add build-in BloomFilterCount on TableAPI&SQL
Date Thu, 17 Aug 2017 21:04:01 GMT

    [ https://issues.apache.org/jira/browse/FLINK-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131296#comment-16131296
] 

Fabian Hueske commented on FLINK-7465:
--------------------------------------

I think an approximate count distinct would be a very nice feature.
AFAIK, count-min sketches are more commonly used and probably better suited for this use case.
Similar to bloom filters, they are based on bitmaps and hashing and can be tuned for space
and accuracy. 

[~sunjincheng121] How would you like to configure the accuracy, i.e., before the function
is registered or when the function is used in a query?

> Add build-in BloomFilterCount on TableAPI&SQL
> ---------------------------------------------
>
>                 Key: FLINK-7465
>                 URL: https://issues.apache.org/jira/browse/FLINK-7465
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Table API & SQL
>            Reporter: sunjincheng
>            Assignee: sunjincheng
>         Attachments: bloomfilter.png
>
>
> In this JIRA. use BloomFilter to implement counting functions.
> BloomFilter Algorithm description:
> An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different
hash functions defined, each of which maps or hashes some set element to one of the m array
positions, generating a uniform random distribution. Typically, k is a constant, much smaller
than m, which is proportional to the number of elements to be added; the precise choice of
k and the constant of proportionality of m are determined by the intended false positive rate
of the filter.
> To add an element, feed it to each of the k hash functions to get k array positions.
Set the bits at all these positions to 1.
> To query for an element (test whether it is in the set), feed it to each of the k hash
functions to get k array positions. If any of the bits at these positions is 0, the element
is definitely not in the set – if it were, then all the bits would have been set to 1 when
it was inserted. If all are 1, then either the element is in the set, or the bits have by
chance been set to 1 during the insertion of other elements, resulting in a false positive.
> An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show
the positions in the bit array that each set element is mapped to. The element w is not in
the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure,
m = 18 and k = 3. The sketch as follows:
> !bloomfilter.png!
> Reference:
> 1. https://en.wikipedia.org/wiki/Bloom_filter
> 2. https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hive/common/util/BloomFilter.java
> Hi [~fhueske] [~twalthr] I appreciated if you can give me some advice. :-)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message