parquet-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From b...@apache.org
Subject parquet-format git commit: PARQUET-1031: Fix spelling errors, whitespace, GitHub urls
Date Wed, 11 Oct 2017 15:53:25 GMT
Repository: parquet-format
Updated Branches:
  refs/heads/master 84460c5a1 -> 3b04d86a2


PARQUET-1031: Fix spelling errors, whitespace, GitHub urls

rebased pull request https://github.com/apache/parquet-format/pull/39, fixed minor spelling
mistake and the travis-ci URLs (which also pointed to the Parquet/parquet-format one).

@Mistobaan please  let me know if you would like to reclaim the original pull request.

Author: Fabrizio (Misto) Milo <mistobaan@gmail.com>
Author: Anna Szonyi <szonyi@cloudera.com>

Closes #59 from commanderofthegrey/parquet-1031 and squashes the following commits:

e61c3b5 [Anna Szonyi] add back uncompressed_page_size
1cb8163 [Anna Szonyi] PARQUET-1031: Fix spelling errors, whitespace, GitHub urls
67f0064 [Fabrizio (Misto) Milo] explicit that the length has no sign
e901ded [Fabrizio (Misto) Milo] fix misspells
ceda268 [Fabrizio (Misto) Milo] remove spaces
e1f9479 [Fabrizio (Misto) Milo] fix mispell


Project: http://git-wip-us.apache.org/repos/asf/parquet-format/repo
Commit: http://git-wip-us.apache.org/repos/asf/parquet-format/commit/3b04d86a
Tree: http://git-wip-us.apache.org/repos/asf/parquet-format/tree/3b04d86a
Diff: http://git-wip-us.apache.org/repos/asf/parquet-format/diff/3b04d86a

Branch: refs/heads/master
Commit: 3b04d86a251fc9409b65df80c79c24de24298e1a
Parents: 84460c5
Author: Fabrizio (Misto) Milo <mistobaan@gmail.com>
Authored: Wed Oct 11 08:53:21 2017 -0700
Committer: Ryan Blue <blue@apache.org>
Committed: Wed Oct 11 08:53:21 2017 -0700

----------------------------------------------------------------------
 Encodings.md                   | 42 ++++++++--------
 README.md                      | 99 +++++++++++++++++++------------------
 src/main/thrift/parquet.thrift | 40 +++++++--------
 3 files changed, 91 insertions(+), 90 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/parquet-format/blob/3b04d86a/Encodings.md
----------------------------------------------------------------------
diff --git a/Encodings.md b/Encodings.md
index 7cf880e..0450588 100644
--- a/Encodings.md
+++ b/Encodings.md
@@ -27,9 +27,9 @@ This file contains the specification of all supported encodings.
 Supported Types: all
 
 This is the plain encoding that must be supported for types.  It is
-intended to be the simplest encoding.  Values are encoded back to back. 
+intended to be the simplest encoding.  Values are encoded back to back.
 
-The plain encoding is used whenever a more efficient encoding can not be used. It 
+The plain encoding is used whenever a more efficient encoding can not be used. It
 stores the data in the following format:
  - BOOLEAN: [Bit Packed](#RLE), LSB first
  - INT32: 4 bytes little endian
@@ -41,16 +41,16 @@ stores the data in the following format:
  - FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
 
 For native types, this outputs the data as little endian. Floating
-    point types are encoded in IEEE.  
+    point types are encoded in IEEE.
 
 For the byte array type, it encodes the length as a 4 byte little
 endian, followed by the bytes.
 
 ### Dictionary Encoding (PLAIN_DICTIONARY = 2)
-The dictionary encoding builds a dictionary of values encountered in a given column. The

+The dictionary encoding builds a dictionary of values encountered in a given column. The
 dictionary will be stored in a dictionary page per column chunk. The values are stored as
integers
 using the [RLE/Bit-Packing Hybrid](#RLE) encoding. If the dictionary grows too big, whether
in size
-or number of distinct values, the encoding will fall back to the plain encoding. The dictionary
page is 
+or number of distinct values, the encoding will fall back to the plain encoding. The dictionary
page is
 written first, before the data pages of the column chunk.
 
 Dictionary page format: the entries in the dictionary - in dictionary order - using the [plain](#PLAIN)
encoding.
@@ -64,16 +64,16 @@ This encoding uses a combination of bit-packing and run length encoding
to more
 The grammar for this encoding looks like this, given a fixed bit-width known in advance:
 ```
 rle-bit-packed-hybrid: <length> <encoded-data>
-length := length of the <encoded-data> in bytes stored as 4 bytes little endian
+length := length of the <encoded-data> in bytes stored as 4 bytes little endian (unsigned
int32)
 encoded-data := <run>*
-run := <bit-packed-run> | <rle-run>  
-bit-packed-run := <bit-packed-header> <bit-packed-values>  
-bit-packed-header := varint-encode(<bit-pack-count> << 1 | 1)  
-// we always bit-pack a multiple of 8 values at a time, so we only store the number of values
/ 8  
-bit-pack-count := (number of values in this run) / 8  
-bit-packed-values := *see 1 below*  
-rle-run := <rle-header> <repeated-value>  
-rle-header := varint-encode( (number of times repeated) << 1)  
+run := <bit-packed-run> | <rle-run>
+bit-packed-run := <bit-packed-header> <bit-packed-values>
+bit-packed-header := varint-encode(<bit-pack-count> << 1 | 1)
+// we always bit-pack a multiple of 8 values at a time, so we only store the number of values
/ 8
+bit-pack-count := (number of values in this run) / 8
+bit-packed-values := *see 1 below*
+rle-run := <rle-header> <repeated-value>
+rle-header := varint-encode( (number of times repeated) << 1)
 repeated-value := value that is repeated, using a fixed-width of round-up-to-next-byte(bit-width)
 ```
 
@@ -82,14 +82,14 @@ repeated-value := value that is repeated, using a fixed-width of round-up-to-nex
    though the order of the bits in each value remains in the usual order of most significant
to least
    significant. For example, to pack the same values as the example in the deprecated encoding
above:
 
-   The numbers 1 through 7 using bit width 3:  
+   The numbers 1 through 7 using bit width 3:
    ```
    dec value: 0   1   2   3   4   5   6   7
    bit value: 000 001 010 011 100 101 110 111
    bit label: ABC DEF GHI JKL MNO PQR STU VWX
    ```
-   
-   would be encoded like this where spaces mark byte boundaries (3 bytes):  
+
+   would be encoded like this where spaces mark byte boundaries (3 bytes):
    ```
    bit value: 10001000 11000110 11111010
    bit label: HIDEFABC RMNOJKLG VWXSTUPQ
@@ -114,13 +114,13 @@ This implementation is deprecated because the [RLE/bit-packing](#RLE)
hybrid is
 For compatibility reasons, this implementation packs values from the most significant bit
to the least significant bit,
 which is not the same as the [RLE/bit-packing](#RLE) hybrid.
 
-For example, the numbers 1 through 7 using bit width 3:  
+For example, the numbers 1 through 7 using bit width 3:
 ```
 dec value: 0   1   2   3   4   5   6   7
 bit value: 000 001 010 011 100 101 110 111
 bit label: ABC DEF GHI JKL MNO PQR STU VWX
 ```
-would be encoded like this where spaces mark byte boundaries (3 bytes):  
+would be encoded like this where spaces mark byte boundaries (3 bytes):
 ```
 bit value: 00000101 00111001 01110111
 bit label: ABCDEFGH IJKLMNOP QRSTUVWX
@@ -141,7 +141,7 @@ The header is defined as follows:
  * the total value count is stored as a VLQ int
  * the first value is stored as a zigzag VLQ int
 
-Each block contains 
+Each block contains
 ```
 <min delta> <list of bitwidths of miniblocks> <miniblocks>
 ```
@@ -233,4 +233,4 @@ sequence of strings, store the prefix length of the previous entry plus
the suff
 For a longer description, see https://en.wikipedia.org/wiki/Incremental_encoding.
 
 This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed
by
-the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY). 
+the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).

http://git-wip-us.apache.org/repos/asf/parquet-format/blob/3b04d86a/README.md
----------------------------------------------------------------------
diff --git a/README.md b/README.md
index 786cddc..c01aec3 100644
--- a/README.md
+++ b/README.md
@@ -50,11 +50,11 @@ Java resources can be build using `mvn package`. The current stable version
shou
 
 C++ thrift resources can be generated via make.
 
-Thrift can be also code-genned into any other thrift-supported language.
+Thrift can be also code-generated into any other thrift-supported language.
 
 ## Glossary
-  - Block (HDFS block): This means a block in HDFS and the meaning is 
-    unchanged for describing this file format.  The file format is 
+  - Block (HDFS block): This means a block in HDFS and the meaning is
+    unchanged for describing this file format.  The file format is
     designed to work well on top of HDFS.
 
   - File: A HDFS file that must include the metadata for the file.
@@ -73,7 +73,7 @@ Thrift can be also code-genned into any other thrift-supported language.
 
 Hierarchically, a file consists of one or more row groups.  A row group
 contains exactly one column chunk per column.  Column chunks contain one or
-more pages. 
+more pages.
 
 ## Unit of parallelization
   - MapReduce - File/Row Group
@@ -101,14 +101,14 @@ This file and the [thrift definition](src/main/thrift/parquet.thrift)
should be
     4-byte length in bytes of file metadata (little endian)
     4-byte magic number "PAR1"
 
-In the above example, there are N columns in this table, split into M row 
-groups.  The file metadata contains the locations of all the column metadata 
-start locations.  More details on what is contained in the metadata can be found 
+In the above example, there are N columns in this table, split into M row
+groups.  The file metadata contains the locations of all the column metadata
+start locations.  More details on what is contained in the metadata can be found
 in the thrift definition.
 
 Metadata is written after the data to allow for single pass writing.
 
-Readers are expected to first read the file metadata to find all the column 
+Readers are expected to first read the file metadata to find all the column
 chunks they are interested in.  The columns chunks should then be read sequentially.
 
  ![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
@@ -146,36 +146,37 @@ documented in
 [logical-types]: LogicalTypes.md
 
 ## Nested Encoding
-To encode nested columns, Parquet uses the Dremel encoding with definition and 
-repetition levels.  Definition levels specify how many optional fields in the 
+To encode nested columns, Parquet uses the Dremel encoding with definition and
+repetition levels.  Definition levels specify how many optional fields in the
 path for the column are defined.  Repetition levels specify at what repeated field
 in the path has the value repeated.  The max definition and repetition levels can
 be computed from the schema (i.e. how much nesting there is).  This defines the
 maximum number of bits required to store the levels (levels are defined for all
-values in the column).  
+values in the column).
 
 Two encodings for the levels are supported BIT_PACKED and RLE. Only RLE is now used as it
supersedes BIT_PACKED.
 
 ## Nulls
-Nullity is encoded in the definition levels (which is run-length encoded).  NULL values 
-are not encoded in the data.  For example, in a non-nested schema, a column with 1000 NULLs

+Nullity is encoded in the definition levels (which is run-length encoded).  NULL values
+are not encoded in the data.  For example, in a non-nested schema, a column with 1000 NULLs
 would be encoded with run-length encoding (0, 1000 times) for the definition levels and
-nothing else.  
+nothing else.
 
 ## Data Pages
 For data pages, the 3 pieces of information are encoded back to back, after the page
-header.  We have the 
- - repetition levels data, 
- - definition levels data,  
- - encoded values.
+header.
+In order we have:
+ 1. repetition levels data
+ 1. definition levels data
+ 1. encoded values
 
-The value of `uncompressed_page_size` specified in the header is for all 3 pieces combined.
+The value of `uncompressed_page_size` specified in the header is for all the 3 pieces combined.
 
-The data for the data page is always required.  The definition and repetition levels
+The encoded values for the data page is always required.  The definition and repetition levels
 are optional, based on the schema definition.  If the column is not nested (i.e.
 the path to the column has length 1), we do not encode the repetition levels (it would
 always have the value 1).  For data that is required, the definition levels are
-skipped (if encoded, it will always have the value of the max definition level). 
+skipped (if encoded, it will always have the value of the max definition level).
 
 For example, in the case where the column is non-nested and required, the data in the
 page is only the encoded values.
@@ -183,53 +184,53 @@ page is only the encoded values.
 The supported encodings are described in [Encodings.md](https://github.com/apache/parquet-format/blob/master/Encodings.md)
 
 ## Column chunks
-Column chunks are composed of pages written back to back.  The pages share a common 
-header and readers can skip over pages they are not interested in.  The data for the 
-page follows the header and can be compressed and/or encoded.  The compression and 
+Column chunks are composed of pages written back to back.  The pages share a common
+header and readers can skip over pages they are not interested in.  The data for the
+page follows the header and can be compressed and/or encoded.  The compression and
 encoding is specified in the page metadata.
 
 ## Checksumming
-Data pages can be individually checksummed.  This allows disabling of checksums at the 
+Data pages can be individually checksummed.  This allows disabling of checksums at the
 HDFS file level, to better support single row lookups.
 
 ## Error recovery
-If the file metadata is corrupt, the file is lost.  If the column metadata is corrupt, 
-that column chunk is lost (but column chunks for this column in other row groups are 
-okay).  If a page header is corrupt, the remaining pages in that chunk are lost.  If 
-the data within a page is corrupt, that page is lost.  The file will be more 
+If the file metadata is corrupt, the file is lost.  If the column metadata is corrupt,
+that column chunk is lost (but column chunks for this column in other row groups are
+okay).  If a page header is corrupt, the remaining pages in that chunk are lost.  If
+the data within a page is corrupt, that page is lost.  The file will be more
 resilient to corruption with smaller row groups.
 
-Potential extension: With smaller row groups, the biggest issue is placing the file 
-metadata at the end.  If an error happens while writing the file metadata, all the 
-data written will be unreadable.  This can be fixed by writing the file metadata 
-every Nth row group.  
-Each file metadata would be cumulative and include all the row groups written so 
-far.  Combining this with the strategy used for rc or avro files using sync markers, 
-a reader could recover partially written files.  
+Potential extension: With smaller row groups, the biggest issue is placing the file
+metadata at the end.  If an error happens while writing the file metadata, all the
+data written will be unreadable.  This can be fixed by writing the file metadata
+every Nth row group.
+Each file metadata would be cumulative and include all the row groups written so
+far.  Combining this with the strategy used for rc or avro files using sync markers,
+a reader could recover partially written files.
 
 ## Separating metadata and column data.
 The format is explicitly designed to separate the metadata from the data.  This
 allows splitting columns into multiple files, as well as having a single metadata
-file reference multiple parquet files.  
+file reference multiple parquet files.
 
 ## Configurations
-- Row group size: Larger row groups allow for larger column chunks which makes it 
-possible to do larger sequential IO.  Larger groups also require more buffering in 
-the write path (or a two pass write).  We recommend large row groups (512MB - 1GB). 
-Since an entire row group might need to be read, we want it to completely fit on 
-one HDFS block.  Therefore, HDFS block sizes should also be set to be larger.  An 
-optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block 
+- Row group size: Larger row groups allow for larger column chunks which makes it
+possible to do larger sequential IO.  Larger groups also require more buffering in
+the write path (or a two pass write).  We recommend large row groups (512MB - 1GB).
+Since an entire row group might need to be read, we want it to completely fit on
+one HDFS block.  Therefore, HDFS block sizes should also be set to be larger.  An
+optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block
 per HDFS file.
-- Data page size: Data pages should be considered indivisible so smaller data pages 
-allow for more fine grained reading (e.g. single row lookup).  Larger page sizes 
-incur less space overhead (less page headers) and potentially less parsing overhead 
-(processing headers).  Note: for sequential scans, it is not expected to read a page 
+- Data page size: Data pages should be considered indivisible so smaller data pages
+allow for more fine grained reading (e.g. single row lookup).  Larger page sizes
+incur less space overhead (less page headers) and potentially less parsing overhead
+(processing headers).  Note: for sequential scans, it is not expected to read a page
 at a time; this is not the IO chunk.  We recommend 8KB for page sizes.
 
 ## Extensibility
 There are many places in the format for compatible extensions:
 - File Version: The file metadata contains a version.
-- Encodings: Encodings are specified by enum and more can be added in the future.  
+- Encodings: Encodings are specified by enum and more can be added in the future.
 - Page types: Additional page types can be added and safely skipped.
 
 ## Contributing
@@ -238,7 +239,7 @@ Changes to this core format definition are proposed and discussed in depth
on th
 
 ## Code of Conduct
 
-We hold ourselves and the Parquet developer community to a code of conduct as described by
[Twitter OSS](https://engineering.twitter.com/opensource): <https://github.com/twitter/code-of-conduct/blob/master/code-of-conduct.md>.

+We hold ourselves and the Parquet developer community to a code of conduct as described by
[Twitter OSS](https://engineering.twitter.com/opensource): <https://github.com/twitter/code-of-conduct/blob/master/code-of-conduct.md>.
 
 ## License
 Copyright 2013 Twitter, Cloudera and other contributors.

http://git-wip-us.apache.org/repos/asf/parquet-format/blob/3b04d86a/src/main/thrift/parquet.thrift
----------------------------------------------------------------------
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 38cddc7..f955347 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -84,12 +84,12 @@ enum ConvertedType {
    * Stored as days since Unix epoch, encoded as the INT32 physical type.
    *
    */
-  DATE = 6; 
+  DATE = 6;
 
-  /** 
-   * A time 
+  /**
+   * A time
    *
-   * The total number of milliseconds since midnight.  The value is stored 
+   * The total number of milliseconds since midnight.  The value is stored
    * as an INT32 physical type.
    */
   TIME_MILLIS = 7;
@@ -104,11 +104,11 @@ enum ConvertedType {
 
   /**
    * A date/time combination
-   * 
+   *
    * Date and time recorded as milliseconds since the Unix epoch.  Recorded as
    * a physical type of INT64.
    */
-  TIMESTAMP_MILLIS = 9; 
+  TIMESTAMP_MILLIS = 9;
 
   /**
    * A date/time combination
@@ -119,11 +119,11 @@ enum ConvertedType {
   TIMESTAMP_MICROS = 10;
 
 
-  /** 
-   * An unsigned integer value.  
-   * 
-   * The number describes the maximum number of meainful data bits in 
-   * the stored value. 8, 16 and 32 bit values are stored using the 
+  /**
+   * An unsigned integer value.
+   *
+   * The number describes the maximum number of meainful data bits in
+   * the stored value. 8, 16 and 32 bit values are stored using the
    * INT32 physical type.  64 bit values are stored using the INT64
    * physical type.
    *
@@ -147,29 +147,29 @@ enum ConvertedType {
   INT_32 = 17;
   INT_64 = 18;
 
-  /** 
+  /**
    * An embedded JSON document
-   * 
+   *
    * A JSON document embedded within a single UTF8 column.
    */
   JSON = 19;
 
-  /** 
+  /**
    * An embedded BSON document
-   * 
-   * A BSON document embedded within a single BINARY column. 
+   *
+   * A BSON document embedded within a single BINARY column.
    */
   BSON = 20;
 
   /**
    * An interval of time
-   * 
+   *
    * This type annotates data stored as a FIXED_LEN_BYTE_ARRAY of length 12
    * This data is composed of three separate little endian unsigned
    * integers.  Each stores a component of a duration of time.  The first
    * integer identifies the number of months associated with the duration,
    * the second identifies the number of days associated with the duration
-   * and the third identifies the number of milliseconds associated with 
+   * and the third identifies the number of milliseconds associated with
    * the provided duration.  This duration of time is independent of any
    * particular timezone or date.
    */
@@ -419,7 +419,7 @@ enum Encoding {
    */
   PLAIN_DICTIONARY = 2;
 
-  /** Group packed run length encoding. Usable for definition/reptition levels
+  /** Group packed run length encoding. Usable for definition/repetition levels
    * encoding and Booleans (on one bit: 0 is false; 1 is true.)
    */
   RLE = 3;
@@ -508,7 +508,7 @@ struct DictionaryPageHeader {
 }
 
 /**
- * New page format alowing reading levels without decompressing the data
+ * New page format allowing reading levels without decompressing the data
  * Repetition and definition levels are uncompressed
  * The remaining section containing the data is compressed if is_compressed is true
  **/


Mime
View raw message