parquet-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ziva...@apache.org
Subject [parquet-format] branch encryption updated: PARQUET-1618: Update encryption spec for bloom filter encryption (#141)
Date Mon, 15 Jul 2019 07:09:04 GMT
This is an automated email from the ASF dual-hosted git repository.

zivanfi pushed a commit to branch encryption
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/encryption by this push:
     new 028b12a  PARQUET-1618: Update encryption spec for bloom filter encryption (#141)
028b12a is described below

commit 028b12a83ee434a7cd3c443b42d35c328ea8c708
Author: ggershinsky <ggershinsky@users.noreply.github.com>
AuthorDate: Mon Jul 15 10:08:59 2019 +0300

    PARQUET-1618: Update encryption spec for bloom filter encryption (#141)
---
 Encryption.md | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/Encryption.md b/Encryption.md
index 63dfadf..a9c54c0 100644
--- a/Encryption.md
+++ b/Encryption.md
@@ -53,7 +53,8 @@ of write/read operations.
 
 ## 3 Technical Approach
 Parquet files are comprised of separately serialized components: pages, page headers, column

-indexes, offset indexes, a footer. Parquet encryption mechanism denotes them as “modules”

+indexes, offset indexes, bloom filter headers and bitsets, the footer. Parquet encryption

+mechanism denotes them as “modules” 
 and encrypts each module separately – making it possible to fetch and decrypt the footer,

 find the offset of required pages, fetch the pages and decrypt the data. In this document,

 the term “footer” always refers to the regular Parquet footer - the `FileMetaData` structure,

@@ -78,15 +79,12 @@ in order to verify its integrity. New footer fields keep an
 information about the file encryption algorithm and the footer signing key.
 
 For encrypted columns, the following modules are always encrypted, with the same column key:

-pages and  page headers (both dictionary and data), column indexes, offset indexes.  If the

+pages and  page headers (both dictionary and data), column indexes, offset indexes, bloom
filter 
+headers and bitsets.  If the 
 column key is different from the footer encryption key, the column metadata is serialized

 separately and encrypted with the column key. In this case, the column metadata is also 
 considered to be a module.  
 
-There are two module types: data modules (pages) and Thrift modules (all Thrift structures
that 
-are serialized separately).
-
-
 ## 4 Encryption Algorithms and Keys
 Parquet encryption algorithms are based on the standard AES ciphers for symmetric encryption.

 AES is supported in Intel and other CPUs with hardware acceleration of crypto operations

@@ -142,8 +140,8 @@ tag used to verify the ciphertext and AAD integrity.
 
 
 #### 4.2.2 AES_GCM_CTR_V1
-In this Parquet algorithm, all Thrift modules are encrypted with the GCM cipher, as described

-above, but the pages are encrypted by the CTR cipher without padding. This allows to encrypt/decrypt

+In this Parquet algorithm, all modules except pages are encrypted with the GCM cipher, as
described 
+above. The pages are encrypted by the CTR cipher without padding. This allows to encrypt/decrypt

 the bulk of the data faster, while still verifying the metadata integrity and making 
 sure the file has not been replaced with a wrong version. However, tampering with the 
 page data might go unnoticed. The AES CTR cipher
@@ -247,6 +245,8 @@ The following module types are defined:
    * Dictionary PageHeader (5)
    * ColumnIndex (6)
    * OffsetIndex (7)
+   * BloomFilter Header (8)
+   * BloomFilter Bitset (9)
 
 
 |                      | Internal File ID | Module type | Row group ordinal | Column ordinal
| Page ordinal|
@@ -259,14 +259,16 @@ The following module types are defined:
 | Dictionary PageHeader|       yes        |   yes (5)   |        yes        |      yes  
    |     no      |
 | ColumnIndex          |       yes        |   yes (6)   |        yes        |      yes  
    |     no      |
 | OffsetIndex          |       yes        |   yes (7)   |        yes        |      yes  
    |     no      |
+| BloomFilter Header   |       yes        |   yes (8)   |        yes        |      yes  
    |     no      |
+| BloomFilter Bitset   |       yes        |   yes (9)   |        yes        |      yes  
    |     no      |
 
 
 
 ## 5 File Format
 
 ### 5.1 Encrypted module serialization
-The Thrift modules are encrypted with the GCM cipher. In the AES_GCM_V1 algorithm, 
-the column pages (data modules) are also encrypted with AES GCM. For each module, the GCM
encryption 
+All modules, except column pages, are encrypted with the GCM cipher. In the AES_GCM_V1 algorithm,

+the column pages are also encrypted with AES GCM. For each module, the GCM encryption 
 buffer is comprised of a nonce, ciphertext and tag, described in the Algorithms section.
The length of 
 the encryption buffer (a 4-byte little endian) is written to the output stream, followed
by the buffer itself.
 


Mime
View raw message