impala-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Ho (JIRA)" <>
Subject [jira] [Resolved] (IMPALA-5197) Parquet scan may incorrectly report "Corrupt Parquet file" in the logs
Date Tue, 09 May 2017 17:41:04 GMT


Michael Ho resolved IMPALA-5197.
       Resolution: Fixed
    Fix Version/s: Impala 2.9.0

IMPALA-5197: Erroneous corrupted Parquet file message
The Parquet file column reader may fail in the middle
of producing a scratch tuple batch for various reasons
such as exceeding memory limit or cancellation. In which
case, the scratch tuple batch may not have materialized
all the rows in a row group. We shouldn't erroneously
report that the file is corrupted in this case as the
column reader didn't completely read the entire row group.

A new test case is added to verify that we won't see this
error message. A new failpoint phase GETNEXT_SCANNER is
also added to differentiate it from the GETNEXT in the
scan node itself.

Change-Id: I9138039ec60fbe9deff250b8772036e40e42e1f6
Reviewed-by: Michael Ho <>
Tested-by: Impala Public Jenkins

> Parquet scan may incorrectly report "Corrupt Parquet file" in the logs
> ----------------------------------------------------------------------
>                 Key: IMPALA-5197
>                 URL:
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.9.0
>            Reporter: Michael Brown
>            Assignee: Michael Ho
>            Priority: Critical
>              Labels: stress
>             Fix For: Impala 2.9.0
> With IMPALA-5186, [~dhecht] noticed messages like:
> {noformat}
> I0407 12:57:05.306138 85140] Corrupt Parquet file 'hdfs://':
column 'ps_partkey' had 1024 remaining values but expected 0
> {noformat}
> I spent a bit more time investigating this, and it seems possible but difficult to reproduce
this, though it's non-deterministic from what I can tell.
> The stress test executes various {{COMPUTE STATS}} statements on the tables under test,
with different {{MT_DOP}} settings. This is also in conjunction with a memory limit which
the stress test applies to each statement.
> Sometimes, it's possible to trigger these corrupt parquet file warnings. When that happens,
the {{COMPUTE STATS}} fails with "memory limit exceeded".
> For example, these queries reproduced the problem on the first try:
> {noformat}
> set mem_limit=1225m;
> set mt_dop=16;
> compute stats tpcds_300_decimal_parquet.store_sales;
> set mem_limit=527m;
> set mt_dop=4;
> compute stats tpcds_300_decimal_parquet.store_sales;
> {noformat}
> These memory limits are right on the edge of the apparent limits of the statement. Sometimes
the statement would appear to completely succeed; other times it would not be able to under
the memory limits, but no corrupt messages were printed.

This message was sent by Atlassian JIRA

View raw message