spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ariya Mizutani (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-22714) Spark API Not responding when Fatal exception occurred in event loop
Date Wed, 06 Dec 2017 12:15:00 GMT
Ariya Mizutani created SPARK-22714:
--------------------------------------

             Summary: Spark API Not responding when Fatal exception occurred in event loop
                 Key: SPARK-22714
                 URL: https://issues.apache.org/jira/browse/SPARK-22714
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.2.0
            Reporter: Ariya Mizutani
            Priority: Critical


To reproduce, let Spark to throw an OOM Exception in event loop:

{code:scala}
scala> spark.sparkContext.getConf.get("spark.driver.memory")
res0: String = 1g
scala> val a = new Array[Int](4 * 1000 * 1000)
scala> val ds = spark.createDataset(a)
scala> ds.rdd.zipWithIndex
[Stage 0:>                                                          (0 + 0) / 3]Exception
in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space
[Stage 0:>                                                          (0 + 0) / 3]
// Spark is not responding
{code}

While not responding, Spark waiting for some Promise, but is never done.
The promise depends some process in event loop thread, but the thread is dead when Fatal exception
is thrown.
{noformat}
"main" #1 prio=5 os_prio=31 tid=0x00007ffc9300b000 nid=0x1703 waiting on condition [0x0000700000216000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000007ad978eb8> (a scala.concurrent.impl.Promise$CompletionLatch)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:619)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
        at org.apache.spark.rdd.ZippedWithIndexRDD.<init>(ZippedWithIndexRDD.scala:50)
        at org.apache.spark.rdd.RDD$$anonfun$zipWithIndex$1.apply(RDD.scala:1293)
        at org.apache.spark.rdd.RDD$$anonfun$zipWithIndex$1.apply(RDD.scala:1293)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.RDD.zipWithIndex(RDD.scala:1292)
{noformat}

I don't know how to fix it properly, but it seems we need to add Fatal error handling to EventLoop.run()
in core/EventLoop.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message