spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-17449) Relation between heartbeatInterval and network timeout
Date Wed, 14 Sep 2016 08:04:20 GMT

     [ https://issues.apache.org/jira/browse/SPARK-17449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen reassigned SPARK-17449:
---------------------------------

    Assignee: Sean Owen

> Relation between heartbeatInterval and network timeout
> ------------------------------------------------------
>
>                 Key: SPARK-17449
>                 URL: https://issues.apache.org/jira/browse/SPARK-17449
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation
>            Reporter: Yang Liang
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 2.1.0
>
>
> $ spark-shell --master yarn --conf spark.executor.heartbeatInterval=20s --num-executors
1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 168136 ms exceeds
timeout 120000 ms
> ERROR YarnScheduler: Lost executor 1 on datanode16: Executor heartbeat timed out after
168136 ms
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s
--num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 11949 ms exceeds
timeout 10000 ms
> ERROR YarnScheduler: Lost executor 1 on datanode31: Executor heartbeat timed out after
11949 m
> spark-shell --master yarn --conf spark.executor.heartbeatInterval=200s --conf spark.network.timeout=10s
--num-executors 1
> WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 39299 ms exceeds
timeout 10000 ms
> ERROR YarnScheduler: Lost executor 1 on datanode19: Executor heartbeat timed out after
39299 ms
> Source Code:
> spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
> /**
>  * A heartbeat from executors to the driver. This is a shared message used by several
internal
>  * components to convey liveness or execution information for in-progress tasks. It will
also
>  * expire the hosts that have not heartbeated for more than spark.network.timeout.
>  */
> private val executorTimeoutMs =
>     sc.conf.getTimeAsSeconds("spark.network.timeout",s"${slaveTimeoutMs}ms") * 1000
> The relation between spark.network.timeout and spark.executor.heartbeatInterval should
be mentioned in the document at least. Otherwise error above would be confusing. Do some checks
when get settings ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message