spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheng Lian (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
Date Wed, 07 Nov 2018 17:35:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678542#comment-16678542
] 

Cheng Lian edited comment on SPARK-25966 at 11/7/18 5:34 PM:
-------------------------------------------------------------

Hey, [~andrioni], if you still have the original (potentially) corrupted Parquet files at
hand, could you please try reading them again with Spark 2.4 but with {{spark.sql.parquet.enableVectorizedReader}}
set to {{false}}? In this way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader.
If it works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is really a file
corruption issue. Unless somehow Spark 2.4 is reading more columns/row groups than Spark 2.2.1
for the same job, and those extra columns/row groups happened to contain some corrupted data,
which would also indicate an optimizer side issue (predicate push-down and column pruning).


was (Author: lian cheng):
Hey, [~andrioni], if you still have the original (potentially) corrupted Parquet files at
hand, could you please try reading them again with Spark 2.4 but with {{spark.sql.parquet.enableVectorizedReader}}
set to {{false}}? In this way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader.
If it works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is really a file
corruption issue. Unless somehow Spark 2.4 is reading more column(s)/row group(s) than Spark
2.2.1 for the same job, and those extra column(s)/row group(s) happened to contain some corrupted
data, which would also indicate an optimizer side issue (predicate push-down and column pruning).

> "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-25966
>                 URL: https://issues.apache.org/jira/browse/SPARK-25966
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>         Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on top of
a Mesos cluster. Both input and output Parquet files are on S3.
>            Reporter: Alessandro Andrioni
>            Priority: Major
>
> I was persistently getting the following exception while trying to run one Spark job
we have using Spark 2.4.0. It went away after I regenerated from scratch all the input Parquet
files (generated by another Spark job also using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> 	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> 	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> 	at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> 	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> 	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> 	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> 	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> 	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> 	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> 	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> 	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> 	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
> 	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> 	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
> 	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
> 	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
> 	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
> 	at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
> 	(...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 312
in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 in stage 682.0 (TID 235229,
10.130.29.78, executor 77): java.io.EOFException: Reached the end of stream with 996 bytes
left to read
> 	at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
> 	at org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
> 	at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
> 	at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
> 	at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
> 	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
> 	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
> 	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
> 	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> 	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
> 	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
> 	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
> 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage109.scan_nextBatch_0$(Unknown
Source)
> 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage109.processNext(Unknown
Source)
> 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> 	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
> 	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> 	at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
> 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:121)
> 	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
> 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:748)
> {code}
> This job used to work fine with Spark 2.2.1, and succeeded once we regenerated the inputs.
This is also one of three jobs that had this issue out of the 6000+ we tested.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message