parquet-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From b...@apache.org
Subject [parquet-format] branch master updated: PARQUET-1290: clarify run lengths for RLE encoding (#96)
Date Mon, 07 May 2018 16:51:08 GMT
This is an automated email from the ASF dual-hosted git repository.

blue pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/master by this push:
     new 709e25e  PARQUET-1290: clarify run lengths for RLE encoding (#96)
709e25e is described below

commit 709e25e9abe05f648dd96d99518141791ba94101
Author: Tim Armstrong <tim.g.armstrong@gmail.com>
AuthorDate: Mon May 7 09:51:05 2018 -0700

    PARQUET-1290: clarify run lengths for RLE encoding (#96)
---
 Encodings.md | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/Encodings.md b/Encodings.md
index f3b8d50..9358b13 100644
--- a/Encodings.md
+++ b/Encodings.md
@@ -72,12 +72,14 @@ length := length of the <encoded-data> in bytes stored as 4 bytes
little endian
 encoded-data := <run>*
 run := <bit-packed-run> | <rle-run>
 bit-packed-run := <bit-packed-header> <bit-packed-values>
-bit-packed-header := varint-encode(<bit-pack-count> << 1 | 1)
+bit-packed-header := varint-encode(<bit-pack-scaled-run-len> << 1 | 1)
 // we always bit-pack a multiple of 8 values at a time, so we only store the number of values
/ 8
-bit-pack-count := (number of values in this run) / 8
+bit-pack-scaled-run-len := (bit-packed-run-len) / 8
+bit-packed-run-len := *see 3 below*
 bit-packed-values := *see 1 below*
 rle-run := <rle-header> <repeated-value>
-rle-header := varint-encode( (number of times repeated) << 1)
+rle-header := varint-encode( (rle-run-len) << 1)
+rle-run-len := *see 3 below*
 repeated-value := value that is repeated, using a fixed-width of round-up-to-next-byte(bit-width)
 ```
 
@@ -107,6 +109,13 @@ repeated-value := value that is repeated, using a fixed-width of round-up-to-nex
 
 2. varint-encode() is ULEB-128 encoding, see https://en.wikipedia.org/wiki/LEB128
 
+3. bit-packed-run-len and rle-run-len must be in the range \[1, 2<sup>31</sup>
- 1\].
+   This means that a Parquet implementation can always store the run length in a signed
+   32-bit integer. This length restriction was not part of the Parquet 2.5.0 and earlier
+   specifications, but longer runs were not readable by the most common Parquet
+   implementations so, in practice, were not safe for Parquet writers to emit.
+
+
 Note that the RLE encoding method is only supported for the following types of
 data:
 

-- 
To stop receiving notification emails like this one, please contact
blue@apache.org.

Mime
View raw message