hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration
Date Fri, 06 Mar 2015 15:43:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350487#comment-14350487

Jason Lowe commented on YARN-3136:

bq. createReleaseCache is only called In serviceInit, so I think should be fine.

But createReleaseCache schedules a timer task that, sometime much later, tries to walk the
applications map without a lock.  It may setup the timer during serviceInit, but is it guaranteed
that there's no contention when this timer task finally runs?  Maybe I'm missing something.

bq. I have a general question that, is AbstractYarnScheduler supposed to be public for external
use ?

I wondered the same.  By far the simplest thing to do here is to just document (or require,
by changing the type from Map to ConcurrentMap as I originally suggested) that the underlying
map must support concurrent access.  If we only expect AbstractYarnScheduler to be used by
the Fifo, Fair, and Capacity schedulers then we don't need to bother with the overhead of
an accessor method that can be overridden, etc.  Technically AbstractYarnScheduler was not
marked Public, so we should be able to update it without worrying about third-party use. 
Agree that we should mark it Private/Unstable going forward regardless of how we eventually
fix this.

> getTransferredContainers can be a bottleneck during AM registration
> -------------------------------------------------------------------
>                 Key: YARN-3136
>                 URL: https://issues.apache.org/jira/browse/YARN-3136
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: scheduler
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Sunil G
>         Attachments: 0001-YARN-3136.patch, 0002-YARN-3136.patch, 0003-YARN-3136.patch,
0004-YARN-3136.patch, 0005-YARN-3136.patch
> While examining RM stack traces on a busy cluster I noticed a pattern of AMs stuck waiting
for the scheduler lock trying to call getTransferredContainers.  The scheduler lock is highly
contended, especially on a large cluster with many nodes heartbeating, and it would be nice
if we could find a way to eliminate the need to grab this lock during this call.  We've already
done similar work during AM allocate calls to make sure they don't needlessly grab the scheduler
lock, and it would be good to do so here as well, if possible.

This message was sent by Atlassian JIRA

View raw message