spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stavros Kontopoulos (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-24641) Spark-Mesos integration doesn't respect request to abort itself
Date Sun, 24 Jun 2018 21:44:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16521690#comment-16521690
] 

Stavros Kontopoulos edited comment on SPARK-24641 at 6/24/18 9:43 PM:
----------------------------------------------------------------------

[~igor.berman] obviously since the connection refusal is not handled properly I think it should.
It means that the service is not available at all. You cant even do any rpc communication.
In that case you can do different things as I mentioned earlier. For example the block manager
when tries to do so ,it tries a number of times before it fails. The onFailure logic captures
AFAIK different types of errors. 

For the scenario you have how about trying to start the service before the mesos agent process
is up, would that help? Maybe [~susanxhuynh] has an alternative here.


was (Author: skonto):
[~igor.berman] obviously since the connection refusal is not handled properly I think it should.
It means that the service is not available at all. You cant even do any rpc communication.
In that case you can do different things. For example the block manager when tries to do so
,it tries a number of times before it fails. The onFailure logic captures AFAIK different
types of errors. 

For the scenario you have how about trying to start the service before the mesos agent process
is up, would that help? Maybe [~susanxhuynh] has an alternative here.

> Spark-Mesos integration doesn't respect request to abort itself
> ---------------------------------------------------------------
>
>                 Key: SPARK-24641
>                 URL: https://issues.apache.org/jira/browse/SPARK-24641
>             Project: Spark
>          Issue Type: Bug
>          Components: Mesos, Shuffle
>    Affects Versions: 2.2.0
>            Reporter: Igor Berman
>            Priority: Major
>
> Hi,
> lately we came across following corner scenario:
> We are using dynamic allocation with external shuffle service that is managed by marathon.
>  
> Due to some network/operation issue, the external shuffle service on one of the machines(mesos-slaves)
is not available for few seconds(e.g. marathon haven't provisioned yet the external shuffle
service on particular node, but framework itself already accepted offer on this node and tries
to startup executor)
>  
> This makes framework(spark driver) to fail and I see error from stderr of driver(seems
like mesos-agent asks driver to abort itself), however spark context continues to run(seems
like in kind of zombi mode, since it can't release resources to cluster and can't get additional
offers since the framework is aborted from mesos perspective)
>  
> The framework in mesos UI move to "inactive" state.
> [~skonto] [~susanxhuynh] any input on this problem? Have you came across such behavior?
> I'm ready to work on some patch, but currently I don't understand where to start, seems
like driver is too fragile in this sense and something in mesos-spark integration is missing
>  
>  
> {code:java}
> I0412 07:31:25.827283   274 sched.cpp:759] Framework registered with 15d9838f-b266-413b-842d-f7c3567bd04a-0051
Exception in thread "Thread-295" java.io.IOException: Failed to connect tomy-company.com/10.106.14.61:7337
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
        at org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.registerDriverWithShuffleService(MesosExternalShuffleClient.java:75)
        at org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: my-company.com/10.106.14.61:7337
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)         at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)         at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748) I0412 07:35:12.032925   277 sched.cpp:2055]
Asked to abort the driver I0412 07:35:12.033035   277 sched.cpp:1233] Aborting framework
15d9838f-b266-413b-842d-f7c3567bd04a-0051  {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message