impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Attila Jeges (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-3989: Display skew warning for poorly formatted Parquet files
Date Thu, 15 Dec 2016 13:51:28 GMT
Attila Jeges has posted comments on this change.

Change subject: IMPALA-3989: Display skew warning for poorly formatted Parquet files

Patch Set 6:

File be/src/exec/

PS6, Line 318: Return 
> Returns true if 'row_group' overlaps with 'split_range'.

PS6, Line 333: (split_start <= row_group_start && split_end >= row_group_end)
> Why is this an invalid case?
See Michael's comment below.

PS6, Line 333: (split_start <= row_group_start && split_end >= row_group_end);
> Why is the case (split_start >= row_group_start && split_end <= row_group_e
If the row group spans multiple splits, then either
- L331 (split_start >= row_group_start && split_start < row_group_end) is TRUE


- L332 (split_end > row_group_start && split_end <= row_group_end) is TRUE,

so we return TRUE.

PS6, Line 333: (split_start <= row_group_start && split_end >= row_group_end)
> If I understand it correctly, this function's sole purpose is to return TRU

I've introduced this function because the Parquet file's last split may contain only the footer
(or part of the footer) and no row groups at all. This can happen even if the Parquet file
is well-formatted, so we shouldn't set the 'misaligned_row_group_skipped' flag in L502.

PS6, Line 463: (row_group_idx_ == -1)
> nit: parenthesis isn't necessary.

PS6, Line 479: skipped all the row groups
> Mind adding a minor remark that we won't be in this path if there is at lea

PS6, Line 499:       if (CheckRowGroupOverlapsSplit(row_group, split_range)) {
             :         // If the row group overlaps the split but the mid-point does not fall
within the
             :         // split, we have a poorly formatted file.
             :         misaligned_row_group_skipped = true;
> misaligned_row_group_skipped |= CheckRowGroupOverlapSplit(row_group, split_

PS6, Line 546:  
> extra whitespace
File be/src/exec/hdfs-parquet-scanner.h:

PS6, Line 446:  Number of scanners
> Is it really number of scanners ? This counter can be bumped multiple times
I could be mistaken but I believe it is the number of scanners.

The scan node creates a separate scanner object for each split and calls the scanner's ProcessSplit()
function to process it. ProcessSplit() calls NextRowGroup() function to iterate over the row
groups that belong to the split.

The counter is bumped only when NextRowGroup() is called the first time on a split but it
cannot find a row group to return (because all the row groups in the split were misaligned).
Since each split is scanned by a different scanner, the counter will be bumped only once at
most per scanner.
File common/thrift/

PS6, Line 315: Row group size doesn't match HDFS block "
             :    "size.
> This would be clearer if you said:

PS6, Line 319: Parquet file '$0' is poorly formatted. Please change fs.s3a.block.size to match
the "
             :    "row group size of your file
> Here, it is a stretch to say that the Parquet file is poorly formatted. The

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibf48d978383d73efdade733a892e795ebd53c76a
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <>
Gerrit-Reviewer: Attila Jeges <>
Gerrit-Reviewer: Michael Ho <>
Gerrit-Reviewer: Sailesh Mukil <>
Gerrit-Reviewer: Thomas Tauber-Marshall <>
Gerrit-HasComments: Yes

View raw message