hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yufei Gu (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-6710) There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair scheduler not assign container to the queue
Date Tue, 20 Jun 2017 23:26:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056662#comment-16056662
] 

Yufei Gu edited comment on YARN-6710 at 6/20/17 11:25 PM:
----------------------------------------------------------

Thanks [~daemon] for the detailed information.

Basically you are saying the latency of handling APP_ATTEMPT_REMOVED cause some issues: 1)
the amResourceUsage issue which has been fixed in YARN-3415, 2) RM shouldn't assign any container
to the application if its appAttempt has finished and there are still resource requests. Issue
2 seems a legitimate issue. For me, it is more a design issue in AM(Mapreduce, Spark) instead
of an RM issue. I am not sure how the scheduler check the status of application attempt for
that situation. If the scheduler already know app attempt has finished, it shouldn't assign
any resources to it at all. Better to check if that part is already here before we move on.



was (Author: yufeigu):
Thanks [~daemon] for the detailed information.

Basically you are saying the latency of handling APP_ATTEMPT_REMOVED cause some issues: 1)
the amResourceUsage issue which has been fixed in YARN-3415, 2) RM shouldn't assign any container
to the application if its appAttempt has finished and there are still resource requests. Issue
2 seems a legitimate issue. For me, it is more a design issue in AM(Mapreduce, Spark) instead
of RM than an RM issue. I am not sure how the scheduler check the status of application attempt
for that situation. If the scheduler already know app attempt has finished, it shouldn't assign
any resources to it at all. Better to check if that part is already here before we move on.


> There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair scheduler
not assign container to the queue
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-6710
>                 URL: https://issues.apache.org/jira/browse/YARN-6710
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.7.2
>            Reporter: daemon
>         Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png,
screenshot-5.png
>
>
> There are over three thousand nodes in my hadoop production cluster, and we use fair
schedule as my scheduler.
> Though there are many free resource in my resource manager´╝î but there are 46 applications
pending. 
> Those applications can not run after  several hours, and in the end I have to stop them.
> I reproduce the scene in my test environment, and I find a bug in FSLeafQueue. 
> In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater than itself.
> When fair scheduler try to assign container to a application attempt,  it will do as
follow check:
> !screenshot-2.png!
> !screenshot-3.png!
> Because the value of FSLeafQueue#amResourceUsage is invalid, it will greater then it
real value.
> So when the value of amResourceUsage greater than the value of Resources.multiply(getFairShare(),
maxAMShare) ,
> and the FSLeafQueue#canRunAppAM function will return false which will let the fair scheduler
not assign container
> to the FSAppAttempt. 
> In this scenario´╝î all the application attempt will pending and never get any resource.
> I find the reason why so many applications in my leaf queue is pending. I will describe
it as follow:
> When fair scheduler first assign a container to the application attempt, it will do something
as blow:
> !screenshot-4.png!
> When fair scheduler remove the application attempt from the leaf queue, it will do something
as blow:
> !screenshot-5.png!
> But when application attempt unregister itself, and all the container in the SchedulerApplicationAttempt#liveContainers

> are complete.  There is a APP_ATTEMPT_REMOVED event will send to fair scheduler, but
it is asynchronous.
> Before the application attempt is removed from FSLeafQueue, and there are pending request
in FSAppAttempt.
> The fair scheduler will assign container to the FSAppAttempt, because the size of the
liveContainers will equals to
> 1. 
> So the FSLeafQueue will add to container resource to the FSLeafQueue#amResourceUsage,
 it will
> let the value of amResourceUsage greater then itself. 
> In the end, the value of FSLeafQueue#amResourceUsage is preety large although there is
no application
> it the queue.
> When new application come, and the value of FSLeafQueue#amResourceUsage  greater than
the value
> of Resources.multiply(getFairShare(), maxAMShare), it will let the scheduler never assign
container to
> the queue.
> All of the applications in the queue will always pending.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message