parquet-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ziva...@apache.org
Subject [parquet-format] branch master updated: PARQUET-1561: Inconsistencies in the Delta Encoding specification (#128)
Date Wed, 24 Apr 2019 15:53:44 GMT
This is an automated email from the ASF dual-hosted git repository.

zivanfi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/master by this push:
     new 0622c46  PARQUET-1561: Inconsistencies in the Delta Encoding specification (#128)
0622c46 is described below

commit 0622c460f82ddb41f5d38992fd26a20de10744e1
Author: d-becker <beckerdaniel.dani@gmail.com>
AuthorDate: Wed Apr 24 17:53:39 2019 +0200

    PARQUET-1561: Inconsistencies in the Delta Encoding specification (#128)
    
    * Addressed most points in PARQUET-1561. We should add examples containing
    multiple miniblocks and blocks.
    
    * Defined ULEB128 and zigzag encoding.
    
    * Padding bits and bytes should be zero, but readers should not rely on it.
---
 Encodings.md | 35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/Encodings.md b/Encodings.md
index 9358b13..c9d21f7 100644
--- a/Encodings.md
+++ b/Encodings.md
@@ -153,41 +153,44 @@ repetition and definition levels.
 ### <a name="DELTAENC"></a>Delta Encoding (DELTA_BINARY_PACKED = 5)
 Supported Types: INT32, INT64
 
-This encoding is adapted from the Binary packing described in ["Decoding billions of integers
per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) by D. Lemire and
L. Boytsov
+This encoding is adapted from the Binary packing described in ["Decoding billions of integers
per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) by D. Lemire and
L. Boytsov.
+
+In delta encoding we make use of variable length integers for storing various numbers (not
the deltas themselves). For unsigned values, we use ULEB128, which is the unsigned version
of LEB128 ([https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128]). For signed values, we
use zigzag encoding ([https://developers.google.com/protocol-buffers/docs/encoding#signed-integers])
to map negative values to positive ones and apply ULEB128 on the result.
+
+Delta encoding consists of a header followed by blocks of delta encoded values binary packed.
Each block is made of miniblocks, each of them binary packed with its own bit width.
 
-Delta encoding consists of a header followed by blocks of delta encoded values binary packed.
Each block is made of miniblocks, each of them binary packed with its own bit width. When
there are not enough values to encode a full block we pad with zeros (added to the frame of
reference).
 The header is defined as follows:
 ```
 <block size in values> <number of miniblocks in a block> <total value count>
<first value>
 ```
- * the block size is a multiple of 128 stored as VLQ int
- * the miniblock count per block is a diviser of the block size stored as VLQ int the number
of values in the miniblock is a multiple of 32.
- * the total value count is stored as a VLQ int
- * the first value is stored as a zigzag VLQ int
+ * the block size is a multiple of 128; it is stored as a ULEB128 int
+ * the miniblock count per block is a divisor of the block size such that their quotient,
the number of values in a miniblock, is a multiple of 32; it is stored as a ULEB128 int
+ * the total value count is stored as a ULEB128 int
+ * the first value is stored as a zigzag ULEB128 int
 
 Each block contains
 ```
 <min delta> <list of bitwidths of miniblocks> <miniblocks>
 ```
- * the min delta is a VLQ int (we compute a minimum as we need positive integers for bit
packing)
+ * the min delta is a zigzag ULEB128 int (we compute a minimum as we need positive integers
for bit packing)
  * the bitwidth of each block is stored as a byte
  * each miniblock is a list of bit packed ints according to the bit width stored at the begining
of the block
 
-Having multiple blocks allows us to escape values and restart from a new base value.
+To encode a block, we will:
 
-To encode each delta block, we will:
+1. Compute the differences between consecutive elements. For the first element in the block,
use the last element in the previous block or, in the case of the first block, use the first
value of the whole sequence, stored in the header.
 
-1. Compute the deltas
+2. Compute the frame of reference (the minimum of the deltas in the block). Subtract this
min delta from all deltas in the block. This guarantees that all values are non-negative.
 
-2. Encode the first value as zigzag VLQ int
+3. Encode the frame of reference (min delta) as a zigzag ULEB128 int followed by the bit
widths of the miniblocks and the delta values (minus the min delta) bit packed per miniblock.
 
-3. For each block, compute the frame of reference(minimum of the deltas) for the deltas.
This guarantees
-all deltas are positive.
+Having multiple blocks allows us to adapt to changes in the data by changing the frame of
reference (the min delta) which can result in smaller values after the subtraction which,
again, means we can store them with a lower bit width.
 
-4. encode the frame of reference delta as VLQ int followed by the delta values (minus the
minimum) encoded as bit packed per miniblock.
+If there are not enough values to fill the last miniblock, we pad the miniblock so that its
length is always the number of values in a full miniblock multiplied by the bit width. The
values of the padding bits should be zero, but readers must accept paddings consisting of
arbitrary bits as well.
 
-Steps 2 and 3 are skipped if the number of values in the block is 1.
+If, in the last block, less than ```<number of miniblocks in a block>``` miniblocks
are needed to store the values, the bytes storing the bit widths of the unneeded miniblocks
are still present, their value should be zero, but readers must accept arbitrary values as
well. There are no additional padding bytes for the miniblock bodies though, as if their bit
widths were 0 (regardless of the actual byte values). The reader knows when to stop reading
by keeping track of the number of values read.
 
+The following examples use 8 as the block size to keep the examples short, but in real cases
it would be invalid.
 #### Example 1
 1, 2, 3, 4, 5
 
@@ -226,7 +229,7 @@ The encoded data is
 
 #### Characteristics
 This encoding is similar to the [RLE/bit-packing](#RLE) encoding. However the [RLE/bit-packing](#RLE)
encoding is specifically used when the range of ints is small over the entire page, as is
true of repetition and definition levels. It uses a single bit width for the whole page.
-The delta encoding algorithm described above stores a bit width per mini block and is less
sensitive to variations in the size of encoded integers. It is also somewhat doing RLE encoding
as a block containing all the same values will be bit packed to a zero bit width thus being
only a header.
+The delta encoding algorithm described above stores a bit width per miniblock and is less
sensitive to variations in the size of encoded integers. It is also somewhat doing RLE encoding
as a block containing all the same values will be bit packed to a zero bit width thus being
only a header.
 
 ### Delta-length byte array: (DELTA_LENGTH_BYTE_ARRAY = 6)
 


Mime
View raw message