hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-7086) Release all containers aynchronously
Date Wed, 23 Aug 2017 19:18:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138915#comment-16138915
] 

Jason Lowe commented on YARN-7086:
----------------------------------

We've noticed container release is particularly painful as well, although we haven't seen
it deadlock.

Whether we do this asynchronously or not, one issue is that releasing a bunch of containers
requires grabbing a highly-contended lock for every container released.  Do this in a loop
and it ends up taking a long time since getting the lock is not cheap.  Async scheduling helps
since we can wait in some other thread rather than in the AM handler threads or scheduler
dispatcher thread, but it will still take a long time looping through all those events.  I
think it would be a lot better if there was a bulk-release interface so we could grab the
critical lock once.  We can put a limit on how many we do per batch if we're worried it will
hold that lock for too long, but I don't think it's so much the actual work per container
as it is the time spent waiting for the lock that makes this so painful.


> Release all containers aynchronously
> ------------------------------------
>
>                 Key: YARN-7086
>                 URL: https://issues.apache.org/jira/browse/YARN-7086
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>
> We have noticed in production two situations that can cause deadlocks and cause scheduling
of new containers to come to a halt, especially with regard to applications that have a lot
of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the scheduler releases
all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make sure ALL
container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc [~leftnoteasy],
[~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the AbstractYarnScheduler
and a corresponding scheduler event, which is currently used specifically for the container-update
code paths (where the scheduler realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message