parquet-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ziva...@apache.org
Subject [parquet-format] branch master updated: PARUQET-1609: Update to use xxHash as hash strategy (#139)
Date Thu, 04 Jul 2019 15:34:07 GMT
This is an automated email from the ASF dual-hosted git repository.

zivanfi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/master by this push:
     new 8f1783e  PARUQET-1609: Update to use xxHash as hash strategy (#139)
8f1783e is described below

commit 8f1783ec0b273e89c884b46c0f527d0a48321826
Author: Chen, Junjie <jimmyjchen@tencent.com>
AuthorDate: Thu Jul 4 23:34:02 2019 +0800

    PARUQET-1609: Update to use xxHash as hash strategy (#139)
---
 BloomFilter.md                 | 12 ++++++------
 src/main/thrift/parquet.thrift | 12 +++++++-----
 2 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/BloomFilter.md b/BloomFilter.md
index be27aef..1be752a 100644
--- a/BloomFilter.md
+++ b/BloomFilter.md
@@ -87,11 +87,11 @@ void Mask(uint32_t key, uint32_t mask[8]) {
 ```
 
 #### Hash Function
-The function used to hash values in the initial implementation is MurmurHash3[4], using
-the least-significant 64 bits of the 128-bit version of the function on the x86-64
-platform. Note that the function produces different values on different architectures, so
-implementors must be careful to use the version specific to x86-64. That function can be
-emulated on different platforms without difficulty.
+The function used to hash values in the initial implementation is xxHash[4], using
+the least-significant 64 bits version of the function on the x86-64 platform. Note that
+the function produces different values on different architectures, so implementors must
+be careful to use the version specific to x86-64. That function can be emulated on
+different platforms without difficulty.
 
 #### Build a Bloom filter
 The fact that exactly eight bits are checked during each lookup means that these filters
@@ -116,5 +116,5 @@ hash function.
 1. [Bloom filter introduction at Wiki](https://en.wikipedia.org/wiki/Bloom_filter)
 2. [Cache-, Hash- and Space-Efficient Bloom Filters](http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf)
 3. [A Reliable Randomized Algorithm for the Closest-Pair Problem](http://www.diku.dk/~jyrki/Paper/CP-11.4.1997.ps)
-4. [Murmur Hash at Wiki](https://en.wikipedia.org/wiki/MurmurHash)
+4. [xxHash](https://cyan4973.github.io/xxHash/)
 5. [Network Applications of Bloom Filters: A Survey](https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf)
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index d09bb76..231705f 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -569,17 +569,19 @@ union BloomFilterAlgorithm {
   /** Block-based Bloom filter. **/
   1: SplitBlockAlgorithm BLOCK;
 }
-/** Hash strategy type annotation. It uses Murmur3Hash_x64_128 from the original SMHasher
- * repo by Austin Appleby.
+
+/** Hash strategy type annotation. xxHash is an extremely fast non-cryptographic hash
+ * algorithm. It uses 64 bits version of xxHash. 
  **/
-struct Murmur3Hash {}
+struct XxHash {}
+
 /** 
  * The hash function used in Bloom filter. This function takes the hash of a column value
  * using plain encoding.
  **/
 union BloomFilterHash {
-  /** Murmur3 Hash Strategy. **/
-  1: Murmur3Hash MURMUR3;
+  /** xxHash Strategy. **/
+  1: XxHash XXHASH;
 }
 /**
   * Bloom filter header is stored at beginning of Bloom filter data of each column


Mime
View raw message