spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
Date Tue, 28 Apr 2015 16:02:08 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517280#comment-14517280
] 

Sean Owen commented on SPARK-5529:
----------------------------------

[~arov] CDH always has the latest upstream minor release in minor releases, and back-ports
maintenance release fixes into maintenance releases. This is on about the same 3-4 month cycle
as Spark, so it's about as fast one could expect; CDH 5.4 = 1.3.x already. This change isn't
even in a Spark release yet, so yes you want it to be back-ported to 1.3, probably. That has
to precede ending up in CDH though.

> BlockManager heartbeat expiration does not kill executor
> --------------------------------------------------------
>
>                 Key: SPARK-5529
>                 URL: https://issues.apache.org/jira/browse/SPARK-5529
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, YARN
>    Affects Versions: 1.2.0
>            Reporter: Hong Shen
>            Assignee: Hong Shen
>             Fix For: 1.4.0
>
>         Attachments: SPARK-5529.patch
>
>
> When I run a spark job, one executor is hold, after 120s, blockManager is removed by
driver, but after half an hour before the executor is remove by  driver. Here is the log:
> {code}
> 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1,
10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 120000ms
> ....
> 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14:
remote Akka client disassociated
> 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0
> 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14):
ExecutorLostFailure (executor 1 lost)
> 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor
1
> 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster.
> 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message