parquet-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [parquet-format] branch encryption updated: PARQUET-1232: Encryption docs (#110)
Date Mon, 15 Oct 2018 06:23:43 GMT
This is an automated email from the ASF dual-hosted git repository.

zivanfi pushed a commit to branch encryption
in repository

The following commit(s) were added to refs/heads/encryption by this push:
     new 4bd026c  PARQUET-1232: Encryption docs (#110)
4bd026c is described below

commit 4bd026caff8816c4d2a4f0f7d6c75818896579d9
Author: ggershinsky <>
AuthorDate: Mon Oct 15 09:23:39 2018 +0300

    PARQUET-1232: Encryption docs (#110)
---                         | 234 ++++++++++++++++++++++++++++++++++                             |  16 ++-
 doc/images/FileLayoutEncryptionEF.jpg | Bin 0 -> 123128 bytes
 doc/images/FileLayoutEncryptionPF.jpg | Bin 0 -> 117345 bytes
 4 files changed, 248 insertions(+), 2 deletions(-)

diff --git a/ b/
new file mode 100644
index 0000000..156d9ce
--- /dev/null
+++ b/
@@ -0,0 +1,234 @@
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+# Parquet Modular Encryption
+Parquet files, containing sensitive information, can be protected by the modular
+encryption mechanism, that encrypts and authenticates the file data and metadata - 
+while allowing for a regular Parquet functionality (columnar projection, 
+predicate pushdown, encoding and compression). The mechanism also enables column access 
+control, via support for encryption of different columns with different keys.
+## Problem Statement
+The existing data protection solutions (such as flat encryption of files, in-storage 
+encryption, or a use of an encrypting storage client) can be applied to Parquet files,
+but have various security or performance issues. An encryption mechanism, integrated in
+the Parquet format, allows for an optimal combination of data security, processing
+speed and access control granularity.
+## Goals
+1. Protect Parquet data and metadata by encryption, while enabling selective reads 
+(columnar projection, predicate push-down).
+2. Implement "client-side" encryption/decryption (storage client). The storage server 
+must not see plaintext data, metadata or encryption keys.
+3. Leverage authenticated encryption that allows clients to check integrity of the 
+retrieved data - making sure the file (or file parts) had not been replaced with a 
+wrong version, or tampered with otherwise.
+4. Support column access control - by enabling different encryption keys for different 
+columns, and for the footer.
+5. Allow for partial encryption - encrypt only column(s) with sensitive data.
+6. Work with all compression and encoding mechanisms supported in Parquet.
+7. Support multiple encryption algorithms, to account for different security and 
+performance requirements.
+8. Enable two modes for metadata protection:
+   * full protection of file metadata
+   * partial protection of file metadata, that allows legacy readers to access unencrypted

+ columns in an encrypted file.
+9. Miminize overhead of encryption: in terms of size of encrypted files, and throughput
+of write/read operations.
+## Technical Approach
+Each Parquet module (footer, page headers, pages, column indexes, column metadata) is 
+encrypted separately. Then it is possible to fetch and decrypt the footer, find the 
+offset of a required page, fetch it and decrypt the data. In this document, the term 
+“footer” always refers to the regular Parquet footer - the `FileMetaData` structure,
+its nested fields (row groups / column chunks).
+The results of compression of column pages are encrypted, before being written to the 
+output stream. A new Thrift structure, with a column crypto metadata, is added to 
+column chunks of the encrypted columns. This metadata provides information about the 
+column encryption keys.
+The results of Thrift serialization of metadata structures are encrypted, before being 
+written to the output stream.
+## Encryption algorithms
+Parquet encryption algorithms are based on the standard AES ciphers for symmetric 
+encryption. AES is supported in Intel and other CPUs with hardware acceleration of 
+crypto operations (“AES-NI”) - that can be leveraged by e.g. Java programs 
+(automatically via HotSpot), or C++ programs (via EVP-* functions in OpenSSL).
+Initially, two algorithms are implemented, one based on a GCM mode of AES, and the other

+on a combination of GCM and CTR modes.
+AES-GCM is an authenticated encryption. Besides the data confidentiality (encryption), it

+supports two levels of integrity verification / authentication: of the data (default), and

+of the data combined with an optional AAD (“additional authenticated data”). The default

+authentication allows to make sure the data has not been tampered with. An AAD is a free

+text to be signed, together with the data. The user can, for example, pass the file name

+with its version (or creation timestamp) as the AAD, to verify the file has not been 
+replaced with an older version.
+Sometimes, a hardware acceleration of AES is unavialable (e.g. in Java 8). Then AES crypto

+operations are implemented in software, and can be somewhat slow, becoming a performance

+bottleneck in certain workloads. AES-CTR is a regular (not authenticated) cipher.
+It is faster than AES-GCM, since it doesn’t perform integrity verification and doesn’t

+calculate the authentication tag. For applications running without AES acceleration and 
+willing to compromise on content verification, AES-CTR can provide a boost in Parquet 
+write/read throughput. The second Parquet algorithm encrypts the data content (pages) 
+with AES-CTR, and the metadata (Thrift structures) with AES-GCM. This allows to encrypt/decrypt

+the bulk of the data faster, while still verifying the metadata integrity and making sure

+the file had not been replaced with a wrong version. However, tampering with the page data

+might go unnoticed. 
+The `AesGcmV1` and `AesGcmCtrV1` structures contain an optional `aad_metadata` field that
+be used by a reader to retrieve the AAD string used for file encryption. The maximal allowed
+length of `aad_metadata` is 512 bytes.
+Parquet-mr/-cpp implementations use the RBG-based IV construction as defined in the NIST

+SP 800-38D document for the GCM ciphers (section 8.2.2).
+### AES_GCM_V1
+All modules are encrypted with the AES-GCM cipher. The authentication tags (16 bytes) are

+written after each ciphertext. The IVs (12 bytes) are written before each ciphertext.
+Thrift modules are encrypted with the AES-GCM cipher, as described above. 
+The pages are encrypted with AES-CTR, where the IVs (16 bytes) are written before each 
+## File Format
+The encrypted Parquet files have a different extension - “.parquet.encrypted”.
+The encryption is flexible - each column and the footer can be encrypted with the same key,

+with a different key, or not encrypted at all.
+The metadata structures (`PageHeader`, `ColumnIndex`, `OffsetIndex`; and sometimes `FileMetaData`
+`ColumnMetaData`, see below) are encrypted after Thrift serialization. For each structure,

+the encryption buffer is comprised of an IV, ciphertext and tag, as described in the 
+Algorithms section. The length of the encryption buffer (a 4-byte little endian) is 
+written to the output stream, followed by the buffer itself.
+The column pages (data and dictionary) are encrypted after compression. For each page, 
+the encryption buffer is comprised of an IV, ciphertext and (in case of AES_GCM_V1) of a

+tag, as described in the Algorithms section. Only the buffer is written to the output 
+stream - not need to write the length of the buffer, since the length (size of the page after
+compression and encryption) is kept in the page headers.
+A `crypto_meta_data` field in set in each `ColumnChunk` in the encrypted columns. 
+`ColumnCryptoMetaData` is a union - the actual structure is chosen depending on whether the

+column is encrypted with the footer key, or with a column-specific key. For the latter, 
+a key metadata can be specified, with a maximal length of 512. Key metadata is a free-form
+byte array that can be used by a reader to retrieve the column encryption key. 
+Parquet file footer, and its nested structures, contain sensitive information - ranging 
+from a secret data (column statistics) to other information that can be exploited by an 
+attacker (e.g. schema, num_values, key_value_metadata, column data offset and size, encoding
and crypto_meta_data). 
+This information is automatically protected when the footer and secret columns are encrypted

+with the same key. In other cases - when column(s) and the footer are encrypted with 
+different keys; or column(s) are encrypted and the footer is not - an extra measure is 
+required to protect the column-specific information in the file footer. In these cases, 
+the column-specific information (kept in `ColumnMetaData` structures) is split from the 
+footer, by utilizing the `required i64 file_offset` parameter in the `ColumnChunk` 
+structure. This allows to serialize each `ColumnMetaData` structure separately, and encrypt

+it with a column-specific key, thus protecting the column stats and other metadata. 
+### Encrypted footer mode
+In files with sensitive column data, a good security practice is to encrypt not only the

+secret columns, but also the file footer metadata, with a separate footer key. This hides
+the file schema / column names, number of rows, key-value properties, column sort order,

+column data offset and size, list of encrypted columns and metadata of the column encryption
+It also makes the footer tamper-proof.
+The columns encrypted with the same key as the footer, don't split the ColumnMetaData from
+ColumnChunks, leaving it at the original location, `optional ColumnMetaData meta_data`. This
+is not set for columns enrypted with a column-specific key.
+A Thrift-serialized `FileCryptoMetaData` structure is written after the footer. It contains

+information on the file encryption algorithm and on the footer (offset in 
+the file, and an optional key metadata, with a maximal length of 512). Then 
+the length of this structure is written, as a 4-byte little endian integer. Then a final

+magic string, "PARE".
+Only the `FileCryptoMetaData` is written as a plaintext, all other file parts are protected
+(as needed) with appropriate keys.
+ ![File Layout - Encrypted footer](doc/images/FileLayoutEncryptionEF.jpg)
+### Plaintext footer mode
+This mode allows legacy Parquet versions (released before the encryption support) to access
+columns in encrypted files - at a price of leaving certain metadata fields unprotected in
these files 
+(not encrypted or tamper-proofed). The plaintext footer mode can be useful during a transitional
+in organizations 
+where some frameworks can't be upgraded to a new Parquet library for a while. Data writers
+upgrade and run with a new Parquet version, producing encrypted files in this mode. Data
+working with a sensitive data, will also upgrade to a new Parquet library. But other readers
+don't need the sensitive columns, can continue working with an older Parquet version. They
will be 
+able to access plaintext columns in encrypted files. A legacy reader, trying to access a
+column data in a ".parquet.encrypted" file with a plaintext footer, will get an  exception.
+specifically, a Thrift parsing exception on an encrypted `PageHeader` structure. Again, using
+Parquet readers for encrypted files is a temporary solution.
+In the plaintext footer mode, the `optional ColumnMetaData meta_data` is set in the `ColumnChunk`

+structure for all columns, but is stripped of the statistics for the sensitive (encrypted)
+These statistics are available for new readers with the column key - they fetch the split
+and decrypt it to get all metadata values. The legacy readers are not aware of the split
+they parse the embedded field as usual. While they can't read the data of the encrypted columns,
+read the metadata to exract the offset and size of the column data - required for input vectorization
+(see the next section).
+An `encryption_algorithm` field is set at the FileMetaData structure. Then the footer is
written as usual, 
+followed by its length (4-byte little endian integer) and a final magic string, "PAR1".
+ ![File Layout: Plaintext footer](doc/images/FileLayoutEncryptionPF.jpg)
+### New fields for vectorized readers
+Apache Spark and other vectorized readers slice a file by using the information on offset
+and size of each row group. In the legacy readers, this is done by running over a list of
all column chunks
+in a row group, reading the relevant information from the column metadata, adding up the
size values
+and picking the offset of the first column as the row group offset. However, vectorization
+needs only a row group metadata, not metadata of individual columns. Also, in files written
in an
+encrypted footer mode, the column metadata is not available to readers without the column
key. Therefore, 
+two new fields are added to the
+`RowGroup` structure - `file_offset` and `total_compressed_size` - that are set upon file
+writing, and allow vectorized readers to query a file even if keys to certain columns are
+not available ('hidden columns'). Naturally, the query itself should not try to access the

+hidden column data.
+## Encryption overhead
+The size overhead of Parquet modular encryption is negligible, since the most of the encryption

+operations are performed on pages (the minimal unit of Parquet data storage and compression).
+overhead order of magnitude is adding ~ 1 byte per each 10,000 bytes of original data.
+The throughput overhead of Parquet modular encryption depends on whether AES enciphering
is done
+in software or hardware. In both cases, performing encryption on full pages (~1MB buffers)
instead of
+on much smaller individual data values causes AES to work at its maximal speed. Preliminary
+show Parquet modular encryption throughput overhead to be up to a few percents in Java 9
diff --git a/ b/
index c759be9..b048c77 100644
--- a/
+++ b/
@@ -114,8 +114,8 @@ chunks they are interested in.  The columns chunks should then be read
  ![File Layout](
 ## Metadata
-There are three types of metadata: file metadata, column (chunk) metadata and page
-header metadata.  All thrift structures are serialized using the TCompactProtocol.
+There are four types of metadata: file metadata, column (chunk) metadata, page
+header metadata and crypto metadata. All thrift structures are serialized using the TCompactProtocol.
  ![Metadata diagram](
@@ -217,6 +217,18 @@ The format is explicitly designed to separate the metadata from the data.
 allows splitting columns into multiple files, as well as having a single metadata
 file reference multiple parquet files.
+## Encryption
+Parquet files, containing sensitive information, can be protected by the modular
+encryption mechanism, that encrypts and authenticates the file data and metadata - 
+while allowing for a regular Parquet functionality (columnar projection, 
+predicate pushdown, encoding and compression). The mechanism also enables column access 
+control, via support for encryption of different columns with different keys.
+Each Parquet module (footer, page headers, pages, column indexes, column metadata) is 
+encrypted separately. Then it is possible to fetch and decrypt the footer, find the 
+offset of a required page, fetch it and decrypt the data. 
+See []( for details.
 ## Configurations
 - Row group size: Larger row groups allow for larger column chunks which makes it
 possible to do larger sequential IO.  Larger groups also require more buffering in
diff --git a/doc/images/FileLayoutEncryptionEF.jpg b/doc/images/FileLayoutEncryptionEF.jpg
new file mode 100644
index 0000000..c589d9c
Binary files /dev/null and b/doc/images/FileLayoutEncryptionEF.jpg differ
diff --git a/doc/images/FileLayoutEncryptionPF.jpg b/doc/images/FileLayoutEncryptionPF.jpg
new file mode 100644
index 0000000..c5300c7
Binary files /dev/null and b/doc/images/FileLayoutEncryptionPF.jpg differ

View raw message