parquet-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [parquet-format] branch master updated: PARQUET-1630: Update Bloom filter format (#146)
Date Mon, 26 Aug 2019 23:27:37 GMT
This is an automated email from the ASF dual-hosted git repository.

blue pushed a commit to branch master
in repository

The following commit(s) were added to refs/heads/master by this push:
     new 3fb10e0  PARQUET-1630: Update Bloom filter format (#146)
3fb10e0 is described below

commit 3fb10e00c2204bf1c6cc91e094c59e84cefcee33
Author: Chen, Junjie <>
AuthorDate: Tue Aug 27 07:27:32 2019 +0800

    PARQUET-1630: Update Bloom filter format (#146)
---                        |  18 ++++++++++++++----
 doc/images/FileLayoutBloomFilter1.png | Bin 0 -> 44025 bytes
 doc/images/FileLayoutBloomFilter2.png | Bin 0 -> 34018 bytes
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/ b/
index b8208c8..2fa24e9 100644
--- a/
+++ b/
@@ -264,10 +264,13 @@ false positive rates:
 |                       41   |  0.001 %                   |
 #### File Format
-The Bloom filter data of a column chunk, which contains the size of the filter in bytes,
-algorithm, the hash function and the Bloom filter bitset, is stored near the footer. The
-filter data offset is stored in column chunk metadata. Here are Bloom filter definitions
+Each multi-block Bloom filter is required to work for only one column chunk. The data of
a multi-block
+bloom filter consists of the bloom filter header followed by the bloom filter bitset. The
bloom filter
+header encodes the size of the bloom filter bit set in bytes that is used to read the bitset.
+Here are the Bloom filter definitions in thrift:
 /** Block-based algorithm type annotation. **/
@@ -323,6 +326,13 @@ struct ColumnMetaData {
+The Bloom filters are grouped by row group and with data for each column in the same order
as the file schema.
+The Bloom filter data can be stored before the page indexes after all row groups. The file
layout looks like:
+ ![File Layout - Bloom filter footer](doc/images/FileLayoutBloomFilter2.png)
+Or it can be stored between row groups, the file layout looks like:
+ ![File Layout - Bloom filter footer](doc/images/FileLayoutBloomFilter1.png)
 #### Encryption
 In the case of columns with sensitive data, the Bloom filter exposes a subset of sensitive
 information such as the presence of value. Therefore the Bloom filter of columns with sensitive
diff --git a/doc/images/FileLayoutBloomFilter1.png b/doc/images/FileLayoutBloomFilter1.png
new file mode 100644
index 0000000..3b21738
Binary files /dev/null and b/doc/images/FileLayoutBloomFilter1.png differ
diff --git a/doc/images/FileLayoutBloomFilter2.png b/doc/images/FileLayoutBloomFilter2.png
new file mode 100755
index 0000000..6bbf770
Binary files /dev/null and b/doc/images/FileLayoutBloomFilter2.png differ

View raw message