spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simeon Simeonov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-9345) Failure to cleanup on exceptions causes persistent I/O problems later on
Date Sat, 25 Jul 2015 16:24:04 GMT

     [ https://issues.apache.org/jira/browse/SPARK-9345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Simeon Simeonov updated SPARK-9345:
-----------------------------------
    Description: 
When using spark-shell in local mode, I've observed the following behavior on a number of
nodes:

# Some operation generates an exception related to Spark SQL processing via {{HiveContext}}.
# From that point on, nothing could be written to Hive with {{saveAsTable}}.
# Another identically-configured version of Spark on the same machine may not exhibit the
problem initially but, with enough exceptions, it will start exhibiting the problem also.
# A new identically-configured installation of the same version on the same machine would
exhibit the problem.

The error is always related to inability to create a temporary folder on HDFS:

{code}
15/07/25 16:03:35 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.io.IOException: Mkdirs failed to create file:/user/hive/warehouse/test/_temporary/0/_temporary/attempt_201507251603_0001_m_000001_0
(exists=false, cwd=file:/home/ubuntu)
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
	at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154)
	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279)
	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
	at org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:83)
	at org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229)
	at org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470)
	at org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360)
	at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172)
	at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
	at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
	at org.apache.spark.scheduler.Task.run(Task.scala:70)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
        ...
{code}

The behavior does not seem related to HDFS as it persists even if the HDFS volume is reformatted.


The behavior is difficult to reproduce reliably but consistently observable with sufficient
Spark SQL experimentation (dozens of exceptions arising from Spark SQL processing). 
The likelihood of this happening goes up substantially if some Spark SQL operation runs out
of memory, which suggests
that the problem is related to cleanup.

In this gist ([https://gist.github.com/ssimeonov/72a64947bc33628d2d11]) you can see how on
the same machine, identically configured 1.3.1 and 1.4.1 versions sharing the same HDFS system
and Hive metastore, behave differently. 1.3.1 can write to Hive. 1.4.1 cannot. The behavior
started happening on 1.4.1 after an out of memory exception on a large job. 


  was:
When using spark-shell in local mode, I've observed the following behavior on a number of
nodes:

# Some operation generates an exception related to Spark SQL processing via {{HiveContext}}.
# From that point on, nothing could be written to Hive with {{saveAsTable}}.
# Another identically-configured version of Spark on the same machine may not exhibit the
problem initially but, with enough exceptions, it will start exhibiting the problem also.
# A new identically-configured installation of the same version on the same machine would
exhibit the problem.

The error is always related to inability to create a temporary folder on HDFS:

{code}
15/07/25 16:03:35 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.io.IOException: Mkdirs failed to create file:/user/hive/warehouse/test/_temporary/0/_temporary/attempt_201507251603_0001_m_000001_0
(exists=false, cwd=file:/home/ubuntu)
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
	at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154)
	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279)
	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
	at org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:83)
	at org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229)
	at org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470)
	at org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360)
	at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172)
	at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
	at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
	at org.apache.spark.scheduler.Task.run(Task.scala:70)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
        ...
{code}

The behavior does not seem related to HDFS as it persists even if the HDFS volume is reformatted.


The behavior is difficult to reproduce reliably but consistently observable with sufficient
Spark SQL experimentation (dozens of errors exceptions arising from Spark SQL processing).

The likelihood of this happening goes up substantially if some Spark SQL operation runs out
of memory, which suggests
that the problem is related to cleanup.

In this gist ([https://gist.github.com/ssimeonov/72a64947bc33628d2d11]) you can see how on
the same machine, identically configured 1.3.1 and 1.4.1 versions sharing the same HDFS system
and Hive metastore, behave differently. 1.3.1 can write to Hive. 1.4.1 cannot. The behavior
started happening on 1.4.1 after an out of memory exception on a large job. 



> Failure to cleanup on exceptions causes persistent I/O problems later on
> ------------------------------------------------------------------------
>
>                 Key: SPARK-9345
>                 URL: https://issues.apache.org/jira/browse/SPARK-9345
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.3.1, 1.4.0, 1.4.1
>         Environment: Ubuntu on AWS
>            Reporter: Simeon Simeonov
>
> When using spark-shell in local mode, I've observed the following behavior on a number
of nodes:
> # Some operation generates an exception related to Spark SQL processing via {{HiveContext}}.
> # From that point on, nothing could be written to Hive with {{saveAsTable}}.
> # Another identically-configured version of Spark on the same machine may not exhibit
the problem initially but, with enough exceptions, it will start exhibiting the problem also.
> # A new identically-configured installation of the same version on the same machine would
exhibit the problem.
> The error is always related to inability to create a temporary folder on HDFS:
> {code}
> 15/07/25 16:03:35 ERROR InsertIntoHadoopFsRelation: Aborting task.
> java.io.IOException: Mkdirs failed to create file:/user/hive/warehouse/test/_temporary/0/_temporary/attempt_201507251603_0001_m_000001_0
(exists=false, cwd=file:/home/ubuntu)
> 	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
> 	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
> 	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
> 	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
> 	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
> 	at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154)
> 	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279)
> 	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
> 	at org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:83)
> 	at org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229)
> 	at org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470)
> 	at org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360)
> 	at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172)
> 	at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
> 	at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:70)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
>         ...
> {code}
> The behavior does not seem related to HDFS as it persists even if the HDFS volume is
reformatted. 
> The behavior is difficult to reproduce reliably but consistently observable with sufficient
Spark SQL experimentation (dozens of exceptions arising from Spark SQL processing). 
> The likelihood of this happening goes up substantially if some Spark SQL operation runs
out of memory, which suggests
> that the problem is related to cleanup.
> In this gist ([https://gist.github.com/ssimeonov/72a64947bc33628d2d11]) you can see how
on the same machine, identically configured 1.3.1 and 1.4.1 versions sharing the same HDFS
system and Hive metastore, behave differently. 1.3.1 can write to Hive. 1.4.1 cannot. The
behavior started happening on 1.4.1 after an out of memory exception on a large job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message