impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-3909: [DOCS] Add general info about Parquet min/max optimization
Date Mon, 05 Jun 2017 14:51:07 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3909: [DOCS] Add general info about Parquet min/max optimization
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7068/1/docs/topics/impala_parquet.xml
File docs/topics/impala_parquet.xml:

PS1, Line 363: data block
Not sure what "data block" means. "each row group and data page" would be more precise.

I feel like the current text may confuse readers about what is in Parquet files in general
versus how Impala writes out files versus what Impala actually makes use of on the read path
right now.

Currently both Impala and other tools write out stats at both the row group and data page
level. The data pages are a smaller granularity. Row groups are much larger granularity. I
think the salient fact there is that there are typically a small number of row groups per
file (1 for Impala).

Impala currently only uses the row group-level statistics to skip over large parts of the
file at a time, but we have plans to use the page-level statistics.


PS1, Line 366: whether the file
"parts of each file", because it could be a data page or row group.


-- 
To view, visit http://gerrit.cloudera.org:8080/7068
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5fd5f7b157024f6089af7feffcb538c160bb130d
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell <jrussell@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message