hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-270) RM scheduler event handler thread gets behind
Date Mon, 17 Dec 2012 20:26:25 GMT

    [ https://issues.apache.org/jira/browse/YARN-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534264#comment-13534264
] 

Vinod Kumar Vavilapalli commented on YARN-270:
----------------------------------------------

Thanks for filing this Thomas. IIRC, The event-handler's upper limit is about 0.6 million,
somehow we only focus on number of nodes and never thought about the scaling issue with large
number of applications. There are multiple solutions for this, in the order of importance:
 - Make NodeManagers to *NOT* blindly heartbeat irrespective of whether previous heartbeat
is processed or not.
 - Figure out any obvious bottlenecks in the scheduling code.
 - When all else fails, try to parallelize the scheduler dispatcher.
                
> RM scheduler event handler thread gets behind
> ---------------------------------------------
>
>                 Key: YARN-270
>                 URL: https://issues.apache.org/jira/browse/YARN-270
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 0.23.5
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>
> We had a couple of incidents on a 2800 node cluster where the RM scheduler event handler
thread got behind processing events and basically become unusable.  It was still processing
apps, but taking a long time (1 hr 45 minutes) to accept new apps.   this actually happened
twice within 5 days.
> We are using the capacity scheduler and at the time had between 400 and 500 applications
running.  There were another 250 apps that were in the SUBMITTED state in the RM but the scheduler
hadn't processed those to put in pending state yet.  We had about 15 queues none of them hierarchical.
 We also had plenty of space lefts on the cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message