spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-19941) Spark should not schedule tasks on executors on decommissioning YARN nodes
Date Tue, 14 Mar 2017 08:13:41 GMT

     [ https://issues.apache.org/jira/browse/SPARK-19941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen updated SPARK-19941:
------------------------------
    Affects Version/s:     (was: 2.2.0)
                       2.1.0
           Issue Type: Improvement  (was: Bug)

> Spark should not schedule tasks on executors on decommissioning YARN nodes
> --------------------------------------------------------------------------
>
>                 Key: SPARK-19941
>                 URL: https://issues.apache.org/jira/browse/SPARK-19941
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, YARN
>    Affects Versions: 2.1.0
>         Environment: Hadoop 2.8.0-rc1
>            Reporter: Karthik Palaniappan
>
> Hadoop 2.8 added a mechanism to gracefully decommission Node Managers in YARN: https://issues.apache.org/jira/browse/YARN-914
> Essentially you can mark nodes to be decommissioned, and let them a) finish work in progress
and b) finish serving shuffle data. But no new work will be scheduled on the node.
> Spark should respect when NMs are set to decommissioned, and similarly decommission executors
on those nodes by not scheduling any more tasks on them.
> It looks like in the future YARN may inform the app master when containers will be killed:
https://issues.apache.org/jira/browse/YARN-3784. However, I don't think Spark should schedule
based on a timeout. We should gracefully decommission the executor as fast as possible (which
is the spirit of YARN-914). The app master can query the RM for NM statuses (if it doesn't
already have them) and stop scheduling on executors on NMs that are decommissioning.
> Stretch feature: The timeout may be useful in determining whether running further tasks
on the executor is even helpful. Spark may be able to tell that shuffle data will not be consumed
by the time the node is decommissioned, so it is not worth computing. The executor can be
killed immediately.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message