[ https://issues.apache.org/jira/browse/FLINK7465?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=16137425#comment16137425
]
Fabian Hueske edited comment on FLINK7465 at 8/28/17 10:20 AM:

I'm sorry, I confused countmin sketches (for approximate group counts) and HyperLogLog (for
approximate distinct counts).
I assume the goal of the BloomFilterCount function is to (approximately) count the number
of distinct values. In contrast to HyperLogLog, Bloom filters are not specifically designed
for approximate distinct counting but for approximate membership testing. AFAIK, bloom filters
should be more precise for log distinct cardinalities but HyperLogLog should provide much
better results for larger cardinalities.
IMO, [~jark]'s idea to split the bitmask into multiple long values is pretty nice. OTOH, multiple
RocksDB point lookups might also be more expensive than a single lookup with larger serialization
payload (the deserialization logic for byte arrays shouldn't be very costly).
was (Author: fhueske):
I'm sorry, I confused countmin sketches (for approximate group counts) and HyperLogLog (for
approximate distinct counts).
I assume the goal of the BloomFilterCount function is to (approximately) count the number
of distinct values. In contrast to HyperLogLog, Bloom filters are not specifically designed
for approximate distinct counting but for approximate membership testing. AFAIK, bloom filters
should be more precise for log distinct cardinalities but HyperLogLog should provide much
better results for larger cardinalities.
IMO, [~jark]'s idea to split the bitmask into multiple long values is pretty nice. OTOH, multiple
RocksDB point lookups might also be more expensive than a single lookup with larger serialization
payload (the deserialization logic for byte arrays shouldn't be very costy).
> Add buildin BloomFilterCount on TableAPI&SQL
> 
>
> Key: FLINK7465
> URL: https://issues.apache.org/jira/browse/FLINK7465
> Project: Flink
> Issue Type: Subtask
> Components: Table API & SQL
> Reporter: sunjincheng
> Assignee: sunjincheng
> Attachments: bloomfilter.png
>
>
> In this JIRA. use BloomFilter to implement counting functions.
> BloomFilter Algorithm description:
> An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different
hash functions defined, each of which maps or hashes some set element to one of the m array
positions, generating a uniform random distribution. Typically, k is a constant, much smaller
than m, which is proportional to the number of elements to be added; the precise choice of
k and the constant of proportionality of m are determined by the intended false positive rate
of the filter.
> To add an element, feed it to each of the k hash functions to get k array positions.
Set the bits at all these positions to 1.
> To query for an element (test whether it is in the set), feed it to each of the k hash
functions to get k array positions. If any of the bits at these positions is 0, the element
is definitely not in the set – if it were, then all the bits would have been set to 1 when
it was inserted. If all are 1, then either the element is in the set, or the bits have by
chance been set to 1 during the insertion of other elements, resulting in a false positive.
> An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show
the positions in the bit array that each set element is mapped to. The element w is not in
the set {x, y, z}, because it hashes to one bitarray position containing 0. For this figure,
m = 18 and k = 3. The sketch as follows:
> !bloomfilter.png!
> Reference:
> 1. https://en.wikipedia.org/wiki/Bloom_filter
> 2. https://github.com/apache/hive/blob/master/storageapi/src/java/org/apache/hive/common/util/BloomFilter.java
> Hi [~fhueske] [~twalthr] I appreciated if you can give me some advice. :)

This message was sent by Atlassian JIRA
(v6.4.14#64029)
