spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lianhui Wang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-5529) Executor is still hold while BlockManager has been removed
Date Wed, 04 Feb 2015 02:30:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509
] 

Lianhui Wang edited comment on SPARK-5529 at 2/4/15 2:29 AM:
-------------------------------------------------------------

the phenomenon is:
blockManagerSlave is timeout  and BlockManagerMasterActor will remove this blockManager, but
executor on this blockManager is not timeout because akka's heartbeat is normal.Because blockManager
is in executor, if blockManager is removed, executor on this blockManager should be removed
too.
Especially when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved
and remove this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap.
[~andrewor14]  when BlockManagerMasterActor remove blockmanager due to timeout of BlockManager,
we need to check whether executor on this blockmanager has been removed. if its executor has
not been removed, we should firstly remove this executor. how about this way to solve this
problem?



was (Author: lianhuiwang):
the phenomenon is:
blockManagerSlave is timeout  and BlockManagerMasterActor will remove this blockManager, but
executor on this blockManager is not timeout because akka's heartbeat is normal.
when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved and remove
this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap.
At this time it is wrong.
[~andrewor14]  when BlockManagerMasterActor remove blockmanager due to timeout of BlockManager,
we need to check whether executor on this blockmanager has been removed. if its executor has
not been removed, we should firstly remove this executor. how about this way to solve this
problem?


> Executor is still hold while BlockManager has been removed
> ----------------------------------------------------------
>
>                 Key: SPARK-5529
>                 URL: https://issues.apache.org/jira/browse/SPARK-5529
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.2.0
>            Reporter: Hong Shen
>
> When I run a spark job, one executor is hold, after 120s, blockManager is removed by
driver, but after half an hour before the executor is remove by  driver. Here is the log:
> 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1,
10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 120000ms
> ....
> 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14:
remote Akka client disassociated
> 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0
> 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14):
ExecutorLostFailure (executor 1 lost)
> 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor
1
> 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster.
> 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message