spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ariya Mizutani (JIRA)" <>
Subject [jira] [Created] (SPARK-22714) Spark API Not responding when Fatal exception occurred in event loop
Date Wed, 06 Dec 2017 12:15:00 GMT
Ariya Mizutani created SPARK-22714:

             Summary: Spark API Not responding when Fatal exception occurred in event loop
                 Key: SPARK-22714
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.2.0
            Reporter: Ariya Mizutani
            Priority: Critical

To reproduce, let Spark to throw an OOM Exception in event loop:

scala> spark.sparkContext.getConf.get("spark.driver.memory")
res0: String = 1g
scala> val a = new Array[Int](4 * 1000 * 1000)
scala> val ds = spark.createDataset(a)
scala> ds.rdd.zipWithIndex
[Stage 0:>                                                          (0 + 0) / 3]Exception
in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space
[Stage 0:>                                                          (0 + 0) / 3]
// Spark is not responding

While not responding, Spark waiting for some Promise, but is never done.
The promise depends some process in event loop thread, but the thread is dead when Fatal exception
is thrown.
"main" #1 prio=5 os_prio=31 tid=0x00007ffc9300b000 nid=0x1703 waiting on condition [0x0000700000216000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000007ad978eb8> (a scala.concurrent.impl.Promise$CompletionLatch)
        at java.util.concurrent.locks.LockSupport.park(
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(
        at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:619)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
        at org.apache.spark.rdd.ZippedWithIndexRDD.<init>(ZippedWithIndexRDD.scala:50)
        at org.apache.spark.rdd.RDD$$anonfun$zipWithIndex$1.apply(RDD.scala:1293)
        at org.apache.spark.rdd.RDD$$anonfun$zipWithIndex$1.apply(RDD.scala:1293)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.RDD.zipWithIndex(RDD.scala:1292)

I don't know how to fix it properly, but it seems we need to add Fatal error handling to
in core/EventLoop.scala

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message