hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-270) RM scheduler event handler thread gets behind
Date Tue, 18 Dec 2012 15:12:15 GMT

    [ https://issues.apache.org/jira/browse/YARN-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534953#comment-13534953
] 

Robert Joseph Evans commented on YARN-270:
------------------------------------------

It cannot exert back pressure currently, but I don't see any reason to think that it could
not be added in the future.  Something as simple as setting a high water mark on the number
of pending events and throttling events from incoming connections until the congestion subsides.

We have see a similar issue in the IPC layer on the AM when too many reducers were trying
to download the mapper locations.  Granted this is not the same code, but it was caused by
asynchronously handling events and buffering up the data so when we got behind we eventually
got OOMs.  I think we will continue to see more issues as we scale up until we solve it generally,
or every single client API call will have to be updated eventually to avoid overloading the
system.
                
> RM scheduler event handler thread gets behind
> ---------------------------------------------
>
>                 Key: YARN-270
>                 URL: https://issues.apache.org/jira/browse/YARN-270
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 0.23.5
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>
> We had a couple of incidents on a 2800 node cluster where the RM scheduler event handler
thread got behind processing events and basically become unusable.  It was still processing
apps, but taking a long time (1 hr 45 minutes) to accept new apps.   this actually happened
twice within 5 days.
> We are using the capacity scheduler and at the time had between 400 and 500 applications
running.  There were another 250 apps that were in the SUBMITTED state in the RM but the scheduler
hadn't processed those to put in pending state yet.  We had about 15 queues none of them hierarchical.
 We also had plenty of space lefts on the cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message