hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5773) RM recovery too slow due to LeafQueue#activateApplication()
Date Tue, 25 Oct 2016 04:28:58 GMT

    [ https://issues.apache.org/jira/browse/YARN-5773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604144#comment-15604144
] 

Sunil G commented on YARN-5773:
-------------------------------

*Issues in Recovery of apps:*
1. activateApplications works under a write lock.
2. If one application is found of overflowing AM resource limit, instead of breaking from
loop, we continue and play complete apps from pendingOrderingPolicy. We may need to iterate
all apps because we have apps belongs to different partition and pendingOrderingPolicy does
not provide any order for apps based on partition.
3. As mentioned by [~bibinchundatt], when each app fails to get activated due to the upper
cut of resource  limit, one INFO log is emitted. During recovery, this is costly.

[~leftnoteasy] and [~rohithsharma]
bq.If a given app's AM resource amount > AM headroom, should we skip the AM and activate
following app which AM resource amount <= AM headroom?
bq.But one point to be considered is for each Node registration, head room changes. So, user
head room changes as new node registered. This need to be taken care.
Currently activateApplications is invoked when there is a change in cluster resource. So any
change in cluster resource will ensure a call to activateApplications and we can recalculate
this headroom. I am not very sure about the suggested map. Will this check be coming before
we do the existing AM resource percentage check for queue/partition (not user based) ? OR
are we replacing this checks?

> RM recovery too slow due to LeafQueue#activateApplication()
> -----------------------------------------------------------
>
>                 Key: YARN-5773
>                 URL: https://issues.apache.org/jira/browse/YARN-5773
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: YARN-5773.0001.patch, YARN-5773.0002.patch
>
>
> # Submit application 10K application to default queue.
> # All applications are in accepted state
> # Now restart resourcemanager
> For each application recovery {{LeafQueue#activateApplications()}} is invoked.Resulting
in AM limit check to be done even before Node managers are getting registered.
> Total iteration for N application is about {{N(N+1)/2}} for {{10K}} application   {{50000000}}
iterations causing time take for Rm to be active more than 10 min.
> Since NM resources are not yet added to during recovery we should skip {{activateApplicaiton()}}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message