flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacob Park (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-7465) Add build-in BloomFilterCount on TableAPI&SQL
Date Mon, 28 Aug 2017 14:47:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143844#comment-16143844

Jacob Park commented on FLINK-7465:

[~fhueske] [~sunjincheng121] Thanks for the context. :)

If HyperLogLogs are out, then how about Cuckoo Filters? They are similar to Bloom Filters,
but they are designed differently as inspired by cuckoo hashing, supports deletion, and takes
approximately the same space. https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf See
https://bdupras.github.io/filter-tutorial/ for an interactive summary. You can also estimate
a count with Cuckoo Filters unlike standard Bloom Filters.

...for reasonably large sized sets, for the same false positive rate as a corresponding Bloom
filter, cuckoo filters use less space than Bloom filters, are faster on lookups (but slower
on insertions/to construct), and amazingly also allow deletions of keys (which Bloom filters
cannot do). -Michael Mitzenmacher (2014)

> Add build-in BloomFilterCount on TableAPI&SQL
> ---------------------------------------------
>                 Key: FLINK-7465
>                 URL: https://issues.apache.org/jira/browse/FLINK-7465
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Table API & SQL
>            Reporter: sunjincheng
>            Assignee: sunjincheng
>         Attachments: bloomfilter.png
> In this JIRA. use BloomFilter to implement counting functions.
> BloomFilter Algorithm description:
> An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different
hash functions defined, each of which maps or hashes some set element to one of the m array
positions, generating a uniform random distribution. Typically, k is a constant, much smaller
than m, which is proportional to the number of elements to be added; the precise choice of
k and the constant of proportionality of m are determined by the intended false positive rate
of the filter.
> To add an element, feed it to each of the k hash functions to get k array positions.
Set the bits at all these positions to 1.
> To query for an element (test whether it is in the set), feed it to each of the k hash
functions to get k array positions. If any of the bits at these positions is 0, the element
is definitely not in the set – if it were, then all the bits would have been set to 1 when
it was inserted. If all are 1, then either the element is in the set, or the bits have by
chance been set to 1 during the insertion of other elements, resulting in a false positive.
> An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show
the positions in the bit array that each set element is mapped to. The element w is not in
the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure,
m = 18 and k = 3. The sketch as follows:
> !bloomfilter.png!
> Reference:
> 1. https://en.wikipedia.org/wiki/Bloom_filter
> 2. https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hive/common/util/BloomFilter.java
> Hi [~fhueske] [~twalthr] I appreciated if you can give me some advice. :-)

This message was sent by Atlassian JIRA

View raw message