This is an automated email from the ASF dualhosted git repository.
blue pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquetformat.git
The following commit(s) were added to refs/heads/master by this push:
new fa53342 PARQUET1630: Clarify the Bloom filter algorithm (#147)
fa53342 is described below
commit fa53342948f79a71d040402a9b44b5413e7bfa6a
Author: Jim Apple <jbapple@apache.org>
AuthorDate: Thu Aug 8 17:18:56 2019 0700
PARQUET1630: Clarify the Bloom filter algorithm (#147)
The specific questions answered from
https://lists.apache.org/thread.html/82d2e50f8c1007720564c5dc64aeae7947e949f3954a83436dc36760@%3Cdev.parquet.apache.org%3E
are
How is the bloom filter block selected from the 32 mostsignificant
bits from of the hash function? These details must be in the spec and
not in papers linked from the spec.
How is the number of blocks determined? From the overall filter size?
I think that the exact procedure for a lookup in each block should be
covered in a section, followed by a section for how to perform a look
up in the multiblock filter. The wording also needs to be cleaned up
so that it is always clear whether the filter that is referenced is a
block or the multiblock filter.
The spec should give more detail on how to choose the number of blocks
and on false positive rates. The sentence with “11.54 bits for each
distinct value inserted into the filter” is vague: is this the
multiblock filter? Why is a 1% falsepositive rate “recommended”?
I think it is okay to use 0.5% as each block’s falsepositive rate,
but then this should state how to achieve an overall falsepositive
rate as a function of the number of distinct values.

BloomFilter.md  251 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 177 insertions(+), 74 deletions()
diff git a/BloomFilter.md b/BloomFilter.md
index e51b412..8ce22ae 100644
 a/BloomFilter.md
+++ b/BloomFilter.md
@@ 33,11 +33,6 @@ overapproximates a set. It can respond to membership queries with either
"defini
"probably yes", where the probability of false positives is configured when the filter is
initialized. Bloom filters do not have false negatives.
A Bloom filter typically contains a bit array of `m` bits, `k` different hash functions,
and `n` elements inserted. Each of hash functions maps or hashes some set element to one
of the
`m` array positions, generating a uniform random distribution. To add an element, feed it
to each
of the `k` hash functions to get `k` array positions. Set the bits at all these positions
to 1.

Because Bloom filters are small compared to dictionaries, they can be used for predicate
pushdown even in columns with high cardinality and when space is at a premium.
@@ 49,87 +44,195 @@ pushdown even in columns with high cardinality and when space is at a
premium.
filters attached or when executing nonselective queries.
### Technical Approach
The initial Bloom filter algorithm in Parquet is implemented using a combination of two
Bloom filter techniques.

First, the block Bloom filter algorithm from Putze et al.'s [Cache, Hash and SpaceEfficient
Bloom filters](http://algo2.iti.kit.edu/documents/cacheefficientbloomfiltersjea.pdf) is
used.
The block Bloom filter consists of a sequence of small Bloom filters, each of which can fit
into one cacheline. For best performance, those small Bloom filters are loaded into memory
cachelinealigned. For each potential element, the first hash value selects the Bloom filter
block
to be used. Additional hash values are then used to set or test bits as usual, but only inside
this one block. As for Parquet's initial implementation, each block is 256 bits. When inserting
or
finding value, the first hash of that value is used to index into the array of blocks and
pick a
single one. This single block is then used for the remaining part of the operation.

Second, the remaining part of the operation within the single block uses the folklore split
Bloom
filter technique, as described in section 2.1 of [Network Applications of Bloom Filters:
A Survey](https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf). Instead of having
one
array of size `m` shared among all the hash functions, each hash function has a range of
`m/k`
consecutive bit locations disjoint from all the others. The total number of bits is still
`m`, but the bits are divided equally among the `k` hash functions. This approach
can be useful for implementation reasons, for example, dividing the bits among the hash functions
may make parallelization of array accesses easier and take utilization of SSE. As for Parquet's
implementation, it divides the 256 bits in each block up into eight contiguous 32bit lanes
and
sets or checks one bit in each lane.

#### Algorithm
In the initial algorithm, the most significant 32 bits from the hash value are used as the
index to select a block from bitset. The lower 32 bits of the hash value, along with eight
constant salt values, are used to compute the bit to set in each lane of the block.
The salt values are used to construct following different hash functions as described in
[Multiplicative hashing](https://en.wikipedia.org/wiki/Hash_function#Multiplicative_hashing):

hash<sub>i</sub>(x) = salt<sub>i</sub> * x >> y

Since the target hash value is `[0, 31]`, so we right shift `y = 27` bits. As a result, eight
hash values, which are indexes of the tiny bloom filter, are generated.

```c
// 8 SALT values used to compute bit pattern
static const uint32_t SALT[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU,
 0x705495c7U, 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U};

// key: the lower 32 bits of hash result
// mask: the output bit pattern for a tiny Bloom filter
void Mask(uint32_t key, uint32_t mask[8]) {
 for (int i = 0; i < 8; ++i) {
 mask[i] = key * SALT[i];
+
+The section describes split block Bloom filters, which is the first
+(and, at time of writing, only) Bloom filter representation supported
+in Parquet.
+
+First we will describe a "block". This is the main component split
+block Bloom filters are composed of.
+
+Each block is 256 bits, broken up into eight contiguous "words", each
+consisting of 32 bits. Each word is thought of as an array of bits;
+each bit is either "set" or "not set".
+
+When initialized, a block is "empty", which means each of the eight
+component words has no bits set. In addition to initialization, a
+block supports two other operations: `block_insert` and
+`block_check`. Both take a single unsigned 32bit integer as input;
+`block_insert` returns no value, but modifies the block, while
+`block_check` returns a boolean. The semantics of `block_check` are
+that it must return `true` if `block_insert` was previously called on
+the block with the same argument, and otherwise it returns `false`
+with high probability. For more details of the probability, see below.
+
+The operations `block_insert` and `block_check` depend on some
+auxiliary artifacts. First, there is a sequence of eight odd unsigned
+32bit integer constants called the `salt`. Second, there is a method
+called `mask` that takes as its argument a single unsigned 32bit
+integer and returns a block in which each word has exactly one bit
+set.
+
+```
+unsigned int32 salt[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU,
+ 0xa2b7289dU, 0x705495c7U, 0x2df1424bU,
+ 0x9efc4947U, 0x5c6bfb31U}
+
+block mask(unsigned int32 x) {
+ block result
+ for i in [0..7] {
+ unsigned int32 y = x * salt[i]
+ result.getWord(i).setBit(y >> 27)
}
 for (int i = 0; i < 8; ++i) {
 mask[i] = mask[i] >> 27;
+ return result
+}
+```
+
+Since there are eight words in the block and eight integers in the
+salt, there is a correspondence between them. To set a bit in the nth
+word of the block, `mask` first multiplies its argument by the nth
+integer in the `salt`, keeping only the least significant 32 bits of
+the 64bit product, then divides that 32bit unsigned integer by 2 to
+the 27th power, denoted above using the C language's right shift
+operator "`>>`". The resulting integer is between 0 and 31,
+inclusive. That integer is the bit that gets set in the word in the
+block.
+
+From the `mask` operation, `block_insert` is defined as setting every
+bit in the block that was also set in the result from mask. Similarly,
+`block_check` returns `true` when every bit that is set in the result
+of `mask` is also set in the block.
+
+```
+void block_insert(block b, unsigned int32 x) {
+ block masked = mask(x)
+ for i in [0..7] {
+ for j in [0..31] {
+ if (masked.getWord(i).isSet(j)) {
+ b.getWord(i).setBit(j)
+ }
+ }
}
 for (int i = 0; i < 8; ++i) {
 mask[i] = UINT32_C(1) << mask[i];
+}
+```
+
+```
+boolean block_check(block b, unsigned int32 x) {
+ block masked = mask(x)
+ for i in [0..7] {
+ for j in [0..31] {
+ if (masked.getWord(i).isSet(j)) {
+ if (not b.getWord(i).setBit(j)) {
+ return false
+ }
+ }
+ }
}
+ return true
+}
+```
+
+The reader will note that a block, as defined here, is actually a
+special kind of Bloom filter. Specifically it is a "split" Bloom
+filter, as described in section 2.1 of [Network Applications of Bloom
+Filters: A
+Survey](https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf). The
+use of multiplication by an odd constant and then shifting right is a
+method of hashing integers as described in section 2.2 of
+Dietzfelbinger et al.'s [A reliable randomized algorithm for the
+closestpair
+problem](http://hjemmesider.diku.dk/~jyrki/Paper/CP11.4.1997.pdf).
+
+This closes the definition of a block and the operations on it.
+
+Now that a block is defined, we can describe Parquet's split block
+Bloom filters. A split block Bloom filter (henceforth "SBBF") is
+composed of `z` blocks, where `z` is a power of two greater than or
+equal to one and less than 2 to the 27th power. When an SBBF is
+initialized, each block in it is initialized, which means each bit in
+each word in each block in the SBBF is unset.
+
+In addition to initialization, an SBBF supports an operation called
+`filter_insert` and one called `filter_check`. Each takes as an
+argument a 64bit unsigned integer; `filter_check` returns a boolean
+and `filter_insert` does not return a value, but does modify the SBBF.
+
+The `filter_insert` operation first uses the most significant 32 bits
+of its argument, modulo the number of blocks, to select a block to
+operate on. It then uses the least significant 32 bits of the argument
+to `filter_insert` as an argument to `block_insert` called on that
+block.
+
+The `filter_check` operation uses the same method as `filter_insert`
+to select a block to operate on, then uses the least significant 32
+bits of its argument as an argument to `block_check` called on that
+block, returning the result.
+
+In the pseudocode below, the modulus operator is represented with the C
+language's "`%`" operator. The "`>>`" operator is used to denote the
+conversion of an unsigned 64bit integer to an unsigned 32bit integer
+containing only the most significant 32 bits, and C's cast operator
+"`(unsigned int32)`" is used to denote the conversion of an unsigned
+64bit integer to an unsigned 32bit integer containing only the least
+significant 32 bits.
+
+```
+void filter_insert(SBBF filter, unsigned int64 x) {
+ block b = filter.getBlock((x >> 32) % filter.numberOfBlocks());
+ block_insert(b, (unsigned int32)x)
}
+```
```
+boolean filter_check(SBBF filter, unsigned int64 x) {
+ block b = filter.getBlock((x >> 32) % filter.numberOfBlocks());
+ return block_check(b, (unsigned int32)x)
+}
+```
+
+The use of blocks is from Putze et al.'s [Cache, Hash and
+SpaceEfficient Bloom
+filters](http://algo2.iti.kit.edu/documents/cacheefficientbloomfiltersjea.pdf)
#### Hash Function
The function used to hash values in the initial implementation is
[xxHash](https://cyan4973.github.io/xxHash/), using the function XXH64 with a
seed of 0 and [following the specification version
+To use an SBBF for values of arbitrary Parquet types, we apply a hash
+function to that value  at the time of writing,
+[xxHash](https://cyan4973.github.io/xxHash/), using the function XXH64
+with a seed of 0 and [following the specification version
0.1.1](https://github.com/Cyan4973/xxHash/blob/v0.7.0/doc/xxhash_spec.md).
#### Build a Bloom filter
The fact that exactly eight bits are checked during each lookup means that these filters
are most space efficient when used with an expected false positive rate of about
0.5%. This is achieved when there are about 11.54 bits for every distinct value inserted
into the filter.
+#### Sizing an SBBF
To calculate the size the filter should be for another false positive rate `p`, use the
following formula. The output is in bits per distinct element:
+The `check` operation in SBBFs can return `true` for an argument that
+was never inserted into the SBBF. These are called "false
+positives". The "false positive probabilty" is the probability that
+any given hash value that was never `insert`ed into the SBBF will
+cause `check` to return `true` (a false positive). There is not a
+simple closedform calculation of this probability, but here is an
+example:
```c
c = 8 / log(1  pow(p, 1.0 / 8))
```
+A filter that uses 1024 blocks and has had 26,214 hash values
+`insert`ed will have a false positive probabilty of around 1.26%. Each
+of those 1024 blocks occupies 256 bits of space, so the total space
+usage is 262,144. That means that the ratio of bits of space to hash
+values is 10to1. Adding more hash values increases the denominator
+and lowers the ratio, which increases the false positive
+probability. For instance, inserting twice as many hash values
+(52,428) decreases the ratio of bits of space per hash value inserted
+to 5to1 and increases the false positive probability to
+18%. Inserting half as many hash values (13,107) increases the ratio
+of bits of space per hash value inserted to 20to1 and decreases the
+false positive probability to 0.04%.
+
+Here are some sample values of the ratios needed to achieve certain
+false positive rates:
In the real scenario, the size of the Bloom filter and the false positive rate may vary from
different implementations. It is recommended to set false positive to 1% so that a Bloom
filter
with 1.2MB size can contain one million distinct values, which should satisfy most cases
according
to default row group size. It is also recommended to expose the ability for setting these
parameters to end users.
+ Bits of space per `insert`  False positive probability 
+    
+ 6.0  10 % 
+ 10.5  1 % 
+ 16.9  0.1 % 
+ 26.4  0.01 % 
+ 41  0.001 % 
#### File Format
The Bloom filter data of a column chunk, which contains the size of the filter in bytes,
the
