spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dong Jiang <dji...@dataxu.com>
Subject Re: Corrupt parquet file
Date Mon, 05 Feb 2018 18:38:40 GMT
Thanks for the response, Ryan.
We have transient EMR cluster, and we do rerun the cluster whenever the cluster failed. However,
in this particular case, the cluster succeeded, not reporting any errors. I was able to null
out the corrupted the column and recover the rest of the 133 columns. I do feel the issue
is more than 1-2 occurrences a year. This is the second time, I am aware of the issue within
a month, and we certainly don’t run as large data infrastructure compared to Netflix.

I will keep an eye on this issue.

Thanks,

Dong

From: Ryan Blue <rblue@netflix.com>
Reply-To: "rblue@netflix.com" <rblue@netflix.com>
Date: Monday, February 5, 2018 at 1:34 PM
To: Dong Jiang <djiang@dataxu.com>
Cc: Spark Dev List <dev@spark.apache.org>
Subject: Re: Corrupt parquet file

We ensure the bad node is removed from our cluster and reprocess to replace the data. We only
see this once or twice a year, so it isn't a significant problem.

We've discussed options for adding write-side validation, but it is expensive and still unreliable
if you don't trust the hardware.

rb

On Mon, Feb 5, 2018 at 10:28 AM, Dong Jiang <djiang@dataxu.com<mailto:djiang@dataxu.com>>
wrote:
Hi, Ryan,

Do you have any suggestions on how we could detect and prevent this issue?
This is the second time we encountered this issue. We have a wide table, with 134 columns
in the file. The issue seems only impact one column, and very hard to detect. It seems you
have encountered this issue before, what do you do to prevent a recurrence?

Thanks,

Dong

From: Ryan Blue <rblue@netflix.com<mailto:rblue@netflix.com>>
Reply-To: "rblue@netflix.com<mailto:rblue@netflix.com>" <rblue@netflix.com<mailto:rblue@netflix.com>>
Date: Monday, February 5, 2018 at 12:46 PM

To: Dong Jiang <djiang@dataxu.com<mailto:djiang@dataxu.com>>
Cc: Spark Dev List <dev@spark.apache.org<mailto:dev@spark.apache.org>>
Subject: Re: Corrupt parquet file

If you can still access the logs, then you should be able to find where the write task ran.
Maybe you can get an instance ID and open a ticket with Amazon. Otherwise, it will probably
start failing the HW checks when the instance hardware is reused, so I wouldn't worry about
it.

The _SUCCESS file convention means that the job ran successfully, at least to the point where
_SUCCESS is created. I wouldn't rely on _SUCCESS to indicate actual job success (you could
do other tasks after that fail) and it carries no guarantee about the data that was written.

rb

On Mon, Feb 5, 2018 at 9:41 AM, Dong Jiang <djiang@dataxu.com<mailto:djiang@dataxu.com>>
wrote:
Hi, Ryan,

Many thanks for your quick response.
We ran Spark on transient EMR clusters. Nothing in the log or EMR events suggests any issues
with the cluster or the nodes. We also see the _SUCCESS file on the S3. If we see the _SUCCESS
file, does that suggest all data is good?
How can we prevent a recurrence? Can you share your experience?

Thanks,

Dong

From: Ryan Blue <rblue@netflix.com<mailto:rblue@netflix.com>>
Reply-To: "rblue@netflix.com<mailto:rblue@netflix.com>" <rblue@netflix.com<mailto:rblue@netflix.com>>
Date: Monday, February 5, 2018 at 12:38 PM
To: Dong Jiang <djiang@dataxu.com<mailto:djiang@dataxu.com>>
Cc: Spark Dev List <dev@spark.apache.org<mailto:dev@spark.apache.org>>
Subject: Re: Corrupt parquet file

Dong,

We see this from time to time as well. In my experience, it is almost always caused by a bad
node. You should try to find out where the file was written and remove that node as soon as
possible.

As far as finding out what is wrong with the file, that's a difficult task. Parquet's encoding
is very dense and corruption in encoded values often looks like different data. When you see
a decoding exception like this, we find it is usually that the compressed data was corrupted
and is no longer valid. You can look for the page of data based on the value counter, but
that's about it.

Even if you could find a single record that was affected, that's not valuable because you
don't know whether there is other corruption that is undetectable. There's nothing to reliably
recover here. What we do in this case is find and remove the bad node, then reprocess data
so we know everything is correct from the upstream source.

rb

On Mon, Feb 5, 2018 at 9:01 AM, Dong Jiang <djiang@dataxu.com<mailto:djiang@dataxu.com>>
wrote:
Hi,

We are running on Spark 2.2.1, generating parquet files, like the following
pseudo code
df.write.parquet(...)
We have recently noticed parquet file corruptions, when reading the parquet
in Spark or Presto, as the following:

Caused by: org.apache.parquet.io<http://org.apache.parquet.io>.ParquetDecodingException:
Can not read
value at 40870 in block 0 in file
file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-4af35426f434.c000.snappy.parquet

Caused by: org.apache.parquet.io<http://org.apache.parquet.io>.ParquetDecodingException:
could not read
page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
in col [incoming_aliases_array, list, element, key_value, value] BINARY

It appears only one column in one of the rows in the file is corrupt, the
file has 111041 rows.

My questions are
1) How can I identify the corrupted row?
2) What could cause the corruption? Spark issue or Parquet issue?

Any help is greatly appreciated.

Thanks,

Dong



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<mailto:dev-unsubscribe@spark.apache.org>



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix
Mime
View raw message