spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Halliday (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-8413) DirectParquetOutputCommitter doesn't clean up the file on task failure
Date Tue, 01 Mar 2016 16:34:18 GMT

    [ https://issues.apache.org/jira/browse/SPARK-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173985#comment-15173985
] 

Peter Halliday commented on SPARK-8413:
---------------------------------------

I'm not sure exactly why we shouldn't use abortTask function that removes the file if it exists,
so that things can be retried.

> DirectParquetOutputCommitter doesn't clean up the file on task failure
> ----------------------------------------------------------------------
>
>                 Key: SPARK-8413
>                 URL: https://issues.apache.org/jira/browse/SPARK-8413
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.1
>            Reporter: Mingyu Kim
>            Priority: Critical
>
> Here are the steps that lead to the failure.
> 1. Write a DataFrame using DirectParquetOutputCommitter
> 2. 1st attempt fails during the writes. e.g. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L355
> 3. There is no clean-up logic on task failure, so the parquet part file written by the
failed task is left half-written.
> 4. 2nd attempt fails with the following exception because the target file already exists.
> {noformat}
> 2015-06-15T15:37:32.703 WARN [task-result-getter-2] org.apache.spark.scheduler.TaskSetManager
- Lost task 56.1 in stage 7.0 (TID 73125, <REDACTED>): java.io.IOException: File already
exists:s3://<REDACTED>
>         at <REDACTED>
> 	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:851)
> 	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:832)
> 	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:731)
> 	at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154)
> 	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279)
> 	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
> 	at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:350)
> 	at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:371)
> 	at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:371)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:64)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message