spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Saisai Shao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-16146) Spark application failed by Yarn preempting
Date Thu, 23 Jun 2016 07:10:16 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345942#comment-15345942
] 

Saisai Shao commented on SPARK-16146:
-------------------------------------

If it is due to preemption, AM log will show the details of preemption executors. Also preempted
executors will not count into failed executors, so the application should be OK. Besides from
my understanding the exception you met is expected when preemption occurs (some RPC message
cannot be sent due to executor preempted).

So from my understanding of the code, preemption will not lead to application failure, would
you please provide some more useful information, it is hard to find out the problems only
from this limited exception stack information.

Also from my knowledge, yarn preemption should not happen frequently, only when other queue's
resource is not enough for apps to run. In your case preemption happens quite frequently,
does it mean your YARN capacity scheduler is not well configured?

> Spark application failed by Yarn preempting
> -------------------------------------------
>
>                 Key: SPARK-16146
>                 URL: https://issues.apache.org/jira/browse/SPARK-16146
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>         Environment: Amazon EC2, centos 6.6,
> Spark-1.6.1-bin-hadoop-2.6(binary from spark official web), Hadoop 2.7.2, preemption
and dynamic allocation enabled.
>            Reporter: Cong Feng
>
> Hi,
> We are setting up our Spark cluster on amz ec2. We are using Spark Yarn client mode,
which is Spark-1.6.1-bin-hadoop-2.6(binary from spark official web) and Hadoop 2.7.2. We also
enable preemption, dynamic allocation and spark.shuffle.service.enabled.
> During our test we found our Spark application frequently get killed when the preemption
happened. Mostly seems driver trying to send rpc to executor which has been preempted before,
also there are some connect rest by peer exceptions which also cause job failed Below are
the typical exceptions we found:
> 16/06/22 08:13:30 ERROR spark.ContextCleaner: Error cleaning RDD 49
> java.io.IOException: Failed to send RPC 5721681506291542850 to nodexx.xx.xxxx.ddns.xx.com/xx.xx.xx.xx:42857:
java.nio.channels.ClosedChannelException
>         at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239)
>         at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226)
>         at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
>         at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567)
>         at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
>         at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801)
>         at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699)
>         at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122)
>         at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
>         at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)
>         at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908)
>         at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960)
>         at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893)
>         at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>         at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedChannelException
> And 
> 16/06/19 22:33:14 INFO storage.BlockManager: Removing RDD 122
> 16/06/19 22:33:14 WARN server.TransportChannelHandler: Exception in connection from nodexx-xx-xx.xx.ddns.xx.com/xx.xx.xx.xx:56618
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
>         at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>         at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
>         at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>         at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>         at java.lang.Thread.run(Thread.java:745)
> 16/06/19 22:33:14 ERROR client.TransportResponseHandler: Still have 2 requests outstanding
when connection from nodexx-xx-xx.xxxx.ddns.xx.com/xx.xx.xx.xx:56618 is closed.
> It happens both to capacity scheduler and fair scheduler. The wired thing is when we
rolled back to Spark 1.4.1, this issue magically disappeared and we can do the preemption
smoothly.
> But we still wants to deploy with Spark 1.6.1. Is this a bug or something we can fixed.
Any ideas will be great helpful to us.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message