spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Corrupt parquet file
Date Tue, 13 Feb 2018 12:18:19 GMT


On 12 Feb 2018, at 20:21, Ryan Blue <rblue@netflix.com<mailto:rblue@netflix.com>>
wrote:

I wouldn't say we have a primary failure mode that we deal with. What we concluded was that
all the schemes we came up with to avoid corruption couldn't cover all cases. For example,
what about when memory holding a value is corrupted just before it is handed off to the writer?

That's why we track down the source of the corruption and remove it from our clusters and
let Amazon know to remove the instance from the hardware pool. We also structure our ETL so
we have some time to reprocess.


I see.

I could remove memory/disk buffering of the blocks as a source of corruption leaving only
working memory  failures which somehow get past ECC, or bus errors of some form.

Filed https://issues.apache.org/jira/browse/HADOOP-15224 to add to the todo list, Hadoop >=
3.2




Mime
View raw message