impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Attila Jeges (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-3989: Display skew warning for poorly formatted Parquet files
Date Thu, 15 Dec 2016 13:51:28 GMT
Attila Jeges has posted comments on this change.

Change subject: IMPALA-3989: Display skew warning for poorly formatted Parquet files
......................................................................


Patch Set 6:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/5400/6/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

PS6, Line 318: Return 
> Returns true if 'row_group' overlaps with 'split_range'.
Done


PS6, Line 333: (split_start <= row_group_start && split_end >= row_group_end)
> Why is this an invalid case?
See Michael's comment below.


PS6, Line 333: (split_start <= row_group_start && split_end >= row_group_end);
> Why is the case (split_start >= row_group_start && split_end <= row_group_e
If the row group spans multiple splits, then either
- L331 (split_start >= row_group_start && split_start < row_group_end) is TRUE

OR 

- L332 (split_end > row_group_start && split_end <= row_group_end) is TRUE,

so we return TRUE.


PS6, Line 333: (split_start <= row_group_start && split_end >= row_group_end)
> If I understand it correctly, this function's sole purpose is to return TRU
Correct.

I've introduced this function because the Parquet file's last split may contain only the footer
(or part of the footer) and no row groups at all. This can happen even if the Parquet file
is well-formatted, so we shouldn't set the 'misaligned_row_group_skipped' flag in L502.


PS6, Line 463: (row_group_idx_ == -1)
> nit: parenthesis isn't necessary.
Done


PS6, Line 479: skipped all the row groups
> Mind adding a minor remark that we won't be in this path if there is at lea
Done


PS6, Line 499:       if (CheckRowGroupOverlapsSplit(row_group, split_range)) {
             :         // If the row group overlaps the split but the mid-point does not fall
within the
             :         // split, we have a poorly formatted file.
             :         misaligned_row_group_skipped = true;
> misaligned_row_group_skipped |= CheckRowGroupOverlapSplit(row_group, split_
Done


PS6, Line 546:  
> extra whitespace
Done


http://gerrit.cloudera.org:8080/#/c/5400/6/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

PS6, Line 446:  Number of scanners
> Is it really number of scanners ? This counter can be bumped multiple times
I could be mistaken but I believe it is the number of scanners.

The scan node creates a separate scanner object for each split and calls the scanner's ProcessSplit()
function to process it. ProcessSplit() calls NextRowGroup() function to iterate over the row
groups that belong to the split.

The counter is bumped only when NextRowGroup() is called the first time on a split but it
cannot find a row group to return (because all the row groups in the split were misaligned).
Since each split is scanned by a different scanner, the counter will be bumped only once at
most per scanner.


http://gerrit.cloudera.org:8080/#/c/5400/6/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

PS6, Line 315: Row group size doesn't match HDFS block "
             :    "size.
> This would be clearer if you said:
Done


PS6, Line 319: Parquet file '$0' is poorly formatted. Please change fs.s3a.block.size to match
the "
             :    "row group size of your file
> Here, it is a stretch to say that the Parquet file is poorly formatted. The
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5400
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibf48d978383d73efdade733a892e795ebd53c76a
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <attilaj@cloudera.com>
Gerrit-Reviewer: Attila Jeges <attilaj@cloudera.com>
Gerrit-Reviewer: Michael Ho <kwho@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sailesh@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message