impala-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Ho (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (IMPALA-5197) Parquet scan may incorrectly report "Corrupt Parquet file" in the logs
Date Tue, 09 May 2017 17:41:04 GMT

     [ https://issues.apache.org/jira/browse/IMPALA-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael Ho resolved IMPALA-5197.
--------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.9.0

IMPALA-5197: Erroneous corrupted Parquet file message
The Parquet file column reader may fail in the middle
of producing a scratch tuple batch for various reasons
such as exceeding memory limit or cancellation. In which
case, the scratch tuple batch may not have materialized
all the rows in a row group. We shouldn't erroneously
report that the file is corrupted in this case as the
column reader didn't completely read the entire row group.

A new test case is added to verify that we won't see this
error message. A new failpoint phase GETNEXT_SCANNER is
also added to differentiate it from the GETNEXT in the
scan node itself.

Change-Id: I9138039ec60fbe9deff250b8772036e40e42e1f6
Reviewed-on: http://gerrit.cloudera.org:8080/6787
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins

> Parquet scan may incorrectly report "Corrupt Parquet file" in the logs
> ----------------------------------------------------------------------
>
>                 Key: IMPALA-5197
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5197
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.9.0
>            Reporter: Michael Brown
>            Assignee: Michael Ho
>            Priority: Critical
>              Labels: stress
>             Fix For: Impala 2.9.0
>
>
> With IMPALA-5186, [~dhecht] noticed messages like:
> {noformat}
> I0407 12:57:05.306138 85140 status.cc:114] Corrupt Parquet file 'hdfs://vc0332.halxg.cloudera.com:8020/user/hive/warehouse/tpch_100_parquet.db/partsupp/3444dbb2ccec395e-45da764500000007_1009013170_data.0.parq':
column 'ps_partkey' had 1024 remaining values but expected 0
> {noformat}
> I spent a bit more time investigating this, and it seems possible but difficult to reproduce
this, though it's non-deterministic from what I can tell.
> The stress test executes various {{COMPUTE STATS}} statements on the tables under test,
with different {{MT_DOP}} settings. This is also in conjunction with a memory limit which
the stress test applies to each statement.
> Sometimes, it's possible to trigger these corrupt parquet file warnings. When that happens,
the {{COMPUTE STATS}} fails with "memory limit exceeded".
> For example, these queries reproduced the problem on the first try:
> {noformat}
> set mem_limit=1225m;
> set mt_dop=16;
> compute stats tpcds_300_decimal_parquet.store_sales;
> set mem_limit=527m;
> set mt_dop=4;
> compute stats tpcds_300_decimal_parquet.store_sales;
> {noformat}
> These memory limits are right on the edge of the apparent limits of the statement. Sometimes
the statement would appear to completely succeed; other times it would not be able to under
the memory limits, but no corrupt messages were printed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message