spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-21444) Fetch failure due to node reboot causes job failure
Date Mon, 17 Jul 2017 23:45:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-21444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Josh Rosen updated SPARK-21444:
-------------------------------
    Affects Version/s:     (was: 2.0.2)
                       2.3.0

> Fetch failure due to node reboot causes job failure
> ---------------------------------------------------
>
>                 Key: SPARK-21444
>                 URL: https://issues.apache.org/jira/browse/SPARK-21444
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.3.0
>            Reporter: Sital Kedia
>
> We started seeing this issue after merging the PR - https://github.com/apache/spark/pull/17955.

> This PR introduced a change to keep the map-output statuses only in the map output tracker
and whenever there is a fetch failure, the scheduler tries to invalidate the map-output statuses
by talking all the the block manager slave end point. However, in case the fetch failure is
because node reboot, the block manager slave end point would not be reachable and so the driver
fails the job as a result. See exception below -
> {code}
> 2017-07-15 05:20:50,255 WARN  storage.BlockManagerMaster (Logging.scala:logWarning(87))
- Failed to remove broadcast 10 with removeFromMaster = true - Connection reset by peer
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>         at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
>         at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>         at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
>         at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>         at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-07-15 05:20:50,275 ERROR scheduler.DAGSchedulerEventProcessLoop (Logging.scala:logError(91))
- DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> org.apache.spark.SparkException: Exception thrown in awaitResult
>         at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
>         at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
>         at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>         at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>         at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>         at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>         at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>         at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:143)
>         at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:271)
>         at org.apache.spark.broadcast.TorrentBroadcast.doDestroy(TorrentBroadcast.scala:165)
>         at org.apache.spark.broadcast.Broadcast.destroy(Broadcast.scala:111)
>         at org.apache.spark.broadcast.Broadcast.destroy(Broadcast.scala:98)
>         at org.apache.spark.ShuffleStatus.invalidateSerializedMapOutputStatusCache(MapOutputTracker.scala:197)
>         at org.apache.spark.ShuffleStatus.removeMapOutput(MapOutputTracker.scala:105)
>         at org.apache.spark.MapOutputTrackerMaster.unregisterMapOutput(MapOutputTracker.scala:420)
>         at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1324)
>         at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1715)
>         at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1673)
>         at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1662)
>         at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Caused by: java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>         at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
>         at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>         at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
>         at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>         at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>         at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message