This is an automated email from the ASF dualhosted git repository.
mdeepak pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquetformat.git
The following commit(s) were added to refs/heads/master by this push:
new 54839ad PARQUET41: Add Bloom filter (#112)
54839ad is described below
commit 54839ad5e04314c944fed8aa4bc6cf15e4a58698
Author: Chen, Junjie <cjjnjust@hotmail.com>
AuthorDate: Fri Oct 12 08:55:08 2018 +0800
PARQUET41: Add Bloom filter (#112)
* PARQUET41: Add Bloom filter
* Grammar and structure tweaking for Bloom filter prose.

BloomFilter.md  120 +++++++++++++++++++++++++++++++++++++++++
src/main/thrift/parquet.thrift  37 +++++++++++++
2 files changed, 157 insertions(+)
diff git a/BloomFilter.md b/BloomFilter.md
new file mode 100644
index 0000000..be27aef
 /dev/null
+++ b/BloomFilter.md
@@ 0,0 +1,120 @@
+ <!
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements. See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership. The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License. You may obtain a copy of the License at
+ 
+  http://www.apache.org/licenses/LICENSE2.0
+ 
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied. See the License for the
+  specific language governing permissions and limitations
+  under the License.
+ >
+
+Parquet Bloom Filter
+===
+### Problem statement
+In their current format, column statistics and dictionaries can be used for predicate
+pushdown. Statistics include minimum and maximum value, which can be used to filter out
+values not in the range. Dictionaries are more specific, and readers can filter out values
+that are between min and max but not in the dictionary. However, when there are too many
+distinct values, writers sometimes choose not to add dictionaries because of the extra
+space they occupy. This leaves columns with large cardinalities and widely separated min
+and max without support for predicate pushdown.
+
+A Bloom filter[1] is a compact data structure that overapproximates a set. It can respond
+to membership queries with either "definitely no" or "probably yes", where the probability
+of false positives is configured when the filter is initialized. Bloom filters do not have
+false negatives.
+
+Because Bloom filters are small compared to dictionaries, they can be used for predicate
+pushdown even in columns with high cardinality and when space is at a premium.
+
+### Goal
+* Enable predicate pushdown for highcardinality columns while using less space than
+ dictionaries.
+
+* Induce no additional I/O overhead when executing queries on columns without Bloom
+ filters attached or when executing nonselective queries.
+
+### Technical Approach
+The initial Bloom filter algorithm in Parquet is implemented using a combination of two
+Bloom filter techniques.
+
+First, the block Bloom filter algorithm from Putze et al.'s "Cache, Hash and
+SpaceEfficient Bloom filters"[2] is used. This divides a filter into many tiny Bloom
+filters, each one of which is called a "block". In Parquet's initial implementation, each
+block is 256 bits. When inserting or finding a value, part of the hash of that value is
+used to index into the array of blocks and pick a single one. This single block is then
+used for the remaining part of the operation.
+
+Second, within each block, this implementation uses the folklore split Bloom filter
+technique, as described in section 2.1 of "Network Applications of Bloom Filters: A
+Survey"[5]. This divides the 256 bits in each block up into eight contiguous 32bit lanes
+and sets or checks one bit in each lane.
+
+#### Algorithm
+In the initial algorithm, the most significant 32 bits from the hash value are used as the
+index to select a block from bitset. The lower 32 bits of the hash value, along with eight
+constant salt values, are used to compute the bit to set in each lane of the block. The
+salt and lower 32 bits are combined using the multiplyshift[3] hash function:
+
+```c
+// 8 SALT values used to compute bit pattern
+static const uint32_t SALT[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU,
+ 0x705495c7U, 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U};
+
+// key: the lower 32 bits of hash result
+// mask: the output bit pattern for a tiny Bloom filter
+void Mask(uint32_t key, uint32_t mask[8]) {
+ for (int i = 0; i < 8; ++i) {
+ mask[i] = key * SALT[i];
+ }
+ for (int i = 0; i < 8; ++i) {
+ mask[i] = mask[i] >> 27;
+ }
+ for (int i = 0; i < 8; ++i) {
+ mask[i] = UINT32_C(1) << mask[i];
+ }
+}
+
+```
+
+#### Hash Function
+The function used to hash values in the initial implementation is MurmurHash3[4], using
+the leastsignificant 64 bits of the 128bit version of the function on the x8664
+platform. Note that the function produces different values on different architectures, so
+implementors must be careful to use the version specific to x8664. That function can be
+emulated on different platforms without difficulty.
+
+#### Build a Bloom filter
+The fact that exactly eight bits are checked during each lookup means that these filters
+are most space efficient when used with an expected false positive rate of about
+0.5%. This is achieved when there are about 11.54 bits for every distinct value inserted
+into the filter.
+
+To calculate the size the filter should be for another false positive rate `p`, use the
+following formula. The output is in bits per distinct element:
+
+```c
+8 / log(1  pow(p, 1.0 / 8));
+```
+
+#### File Format
+The Bloom filter data of a column is stored at the beginning of its column chunk in the
+row group. The column chunk metadata contains the Bloom filter offset. The Bloom filter is
+stored with a header containing the size of the filter in bytes, the algorithm, and the
+hash function.
+
+### Reference
+1. [Bloom filter introduction at Wiki](https://en.wikipedia.org/wiki/Bloom_filter)
+2. [Cache, Hash and SpaceEfficient Bloom Filters](http://algo2.iti.kit.edu/documents/cacheefficientbloomfiltersjea.pdf)
+3. [A Reliable Randomized Algorithm for the ClosestPair Problem](http://www.diku.dk/~jyrki/Paper/CP11.4.1997.ps)
+4. [Murmur Hash at Wiki](https://en.wikipedia.org/wiki/MurmurHash)
+5. [Network Applications of Bloom Filters: A Survey](https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf)
diff git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 6c9011b..378aa47 100644
 a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ 475,6 +475,7 @@ enum PageType {
INDEX_PAGE = 1;
DICTIONARY_PAGE = 2;
DATA_PAGE_V2 = 3;
+ BLOOM_FILTER_PAGE = 4;
}
/**
@@ 554,6 +555,38 @@ struct DataPageHeaderV2 {
8: optional Statistics statistics;
}
+/** Blockbased algorithm type annotation. **/
+struct SplitBlockAlgorithm {}
+/** The algorithm used in Bloom filter. **/
+union BloomFilterAlgorithm {
+ /** Blockbased Bloom filter. **/
+ 1: SplitBlockAlgorithm BLOCK;
+}
+/** Hash strategy type annotation. It uses Murmur3Hash_x64_128 from the original SMHasher
+ * repo by Austin Appleby.
+ **/
+struct Murmur3 {}
+/**
+ * The hash function used in Bloom filter. This function takes the hash of a column value
+ * using plain encoding.
+ **/
+union BloomFilterHash {
+ /** Murmur3 Hash Strategy. **/
+ 1: Murmur3 MURMUR3;
+}
+/**
+ * Bloom filter header is stored at beginning of Bloom filter data of each column
+ * and followed by its bitset.
+ **/
+struct BloomFilterPageHeader {
+ /** The size of bitset in bytes **/
+ 1: required i32 numBytes;
+ /** The algorithm for setting bits. **/
+ 2: required BloomFilterAlgorithm algorithm;
+ /** The hash function used for Bloom filter. **/
+ 3: required BloomFilterHash hash;
+}
+
struct PageHeader {
/** the type of the page: indicates which of the *_header fields is set **/
1: required PageType type
@@ 574,6 +607,7 @@ struct PageHeader {
6: optional IndexPageHeader index_page_header;
7: optional DictionaryPageHeader dictionary_page_header;
8: optional DataPageHeaderV2 data_page_header_v2;
+ 9: optional BloomFilterPageHeader bloom_filter_page_header;
}
/**
@@ 660,6 +694,9 @@ struct ColumnMetaData {
* This information can be used to determine if all data pages are
* dictionary encoded for example **/
13: optional list<PageEncodingStats> encoding_stats;
+
+ /** Byte offset from beginning of file to Bloom filter data. **/
+ 14: optional i64 bloom_filter_offset;
}
struct ColumnChunk {
