parquet-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From fo...@apache.org
Subject [parquet-format] branch master updated: PARQUET-1610: Minor grammatical fixes (#132)
Date Tue, 25 Jun 2019 20:20:08 GMT
This is an automated email from the ASF dual-hosted git repository.

fokko pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/master by this push:
     new 96a8f31  PARQUET-1610: Minor grammatical fixes (#132)
96a8f31 is described below

commit 96a8f3172a3b895408d2d1b939200dd02ab8300d
Author: Umayah Abdennabi <abden003@gmail.com>
AuthorDate: Tue Jun 25 13:20:03 2019 -0700

    PARQUET-1610: Minor grammatical fixes (#132)
---
 PageIndex.md | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/PageIndex.md b/PageIndex.md
index 7ac6e42..551ef0c 100644
--- a/PageIndex.md
+++ b/PageIndex.md
@@ -19,14 +19,14 @@
 
 # ColumnIndex Layout to Support Page Skipping
 
-This documents describes the format for column index pages in the Parquet
+This document describes the format for column index pages in the Parquet
 footer. These pages contain statistics for DataPages and can be used to skip
 pages when scanning data in ordered and unordered columns.
 
 ## Problem Statement
 In previous versions of the format, Statistics are stored for ColumnChunks in
 ColumnMetaData and for individual pages inside DataPageHeader structs. When
-reading pages, a reader had to process the page header in order to determine
+reading pages, a reader had to process the page header to determine
 whether the page could be skipped based on the statistics. This means the reader
 had to access all pages in a column, thus likely reading most of the column
 data from disk.
@@ -34,21 +34,21 @@ data from disk.
 ## Goals
 1. Make both range scans and point lookups I/O efficient by allowing direct
    access to pages based on their min and max values. In particular:
-2. A single-row lookup in a rowgroup based on the sort column of that rowgroup
-   will only read one data page per retrieved column.
-    * Range scans on the sort column will only need to read the exact data
+2. A single-row lookup in a row group based on the sort column of that row group
+  will only read one data page per the retrieved column.
+    * Range scans on the sort column will only need to read the exact data 
       pages that contain relevant data.
     * Make other selective scans I/O efficient: if we have a very selective
       predicate on a non-sorting column, for the other retrieved columns we
       should only need to access data pages that contain matching rows.
 3. No additional decoding effort for scans without selective predicates, e.g.,
-   full-row group scans. If a reader determines that it does not need to read
+   full-row group scans. If a reader determines that it does not need to read 
    the index data, it does not incur any overhead.
 4. Index pages for sorted columns use minimal storage by storing only the
    boundary elements between pages.
 
 ## Non-Goals
-* Support for the equivalent of secondary indices, ie, an index structure
+* Support for the equivalent of secondary indices, i.e., an index structure
   sorted on the key values over non-sorted data.
 
 
@@ -64,9 +64,9 @@ We add two new per-column structures to the row group metadata:
   skipped. Hence the OffsetIndexes for each column in a RowGroup are stored
   together.
 
-The new index structures are stored separately from RowGroup, near the footer,
-so that a reader does not have to pay the I/O and deserialization cost for
-reading the them if it is not doing selective scans. The index structures'
+The new index structures are stored separately from RowGroup, near the footer.  
+This is done so that a reader does not have to pay the I/O and deserialization 
+cost for reading them if it is not doing selective scans. The index structures'
 location and length are stored in ColumnChunk.
 
  ![Page Index Layout](doc/images/PageIndexLayout.png)
@@ -92,10 +92,7 @@ a binary search in `min_values` and `max_values`. For unordered columns,
a
 reader can find matching pages by sequentially reading `min_values` and
 `max_values`.
 
-For range scans this approach can be extended to return ranges of rows, page
+For range scans, this approach can be extended to return ranges of rows, page
 indices, and page offsets to scan in each column. The reader can then
 initialize a scanner for each column and fast forward them to the start row of
 the scan.
-
-
-


Mime
View raw message