hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Íñigo Goiri (JIRA) <j...@apache.org>
Subject [jira] [Commented] (YARN-999) In case of long running tasks, reduce node resource should balloon out resource quickly by calling preemption API and suspending running task.
Date Thu, 28 Feb 2019 21:18:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780929#comment-16780929
] 

Íñigo Goiri commented on YARN-999:
----------------------------------

For completeness, I did a full test end to end on Azure.
I created a cluster with 2 RMs in HA (using the capacity scheduler) and 1 NM with 6GB of memory
running the current trunk (which includes the REST API from YARN-996).
Then I did the following:
# Started a long TeraGen job with 1 AM (2GB) and 2 Mappers (1GB each).
# Used the REST API to decreased the memory to 3GB. This triggered a reduction in resources
and showed -1GB available. It never killed any container.
# Replaced the RM jars with the code in [^YARN-999.007.patch] and made it active.
# Decreased the resources again to 3GB with a timeout of 10 seconds (changing the resources
with the admin interface is not HA). This again showed the -1GB available memory but this
time it sent a preemption message to the AM (not killing, just notifying that one of the mappers
should be preempted).
# After 10 seconds, the RM killed the container.
# Decreased the resources to 2GB with a timeout of 0 seconds. This killed the remaining mapper
immediately.
# Increased the resources again to 4GB. The AM started the 2 mappers again.
# Decreased the resources to 2 GB again with a time out of 1 minute. This triggered the notification
to the AM for preemption.
# Increased the resources to 4GB again. This aborted the over commit protocol and the job
run for 5 more minutes with no issues.

All these tests are also covered by the unit test for both the Capacity Scheduler and the
Fair Scheduler.

> In case of long running tasks, reduce node resource should balloon out resource quickly
by calling preemption API and suspending running task. 
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-999
>                 URL: https://issues.apache.org/jira/browse/YARN-999
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: graceful, nodemanager, scheduler
>            Reporter: Junping Du
>            Assignee: Íñigo Goiri
>            Priority: Major
>         Attachments: YARN-291.000.patch, YARN-999.001.patch, YARN-999.002.patch, YARN-999.003.patch,
YARN-999.004.patch, YARN-999.005.patch, YARN-999.006.patch, YARN-999.007.patch
>
>
> In current design and implementation, when we decrease resource on node to less than
resource consumption of current running tasks, tasks can still be running until the end. But
just no new task get assigned on this node (because AvailableResource < 0) until some tasks
are finished and AvailableResource > 0 again. This is good for most cases but in case of
long running task, it could be too slow for resource setting to actually work so preemption
could be used here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message