hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-2594) ResourceManger sometimes become un-responsive
Date Thu, 25 Sep 2014 02:47:34 GMT

     [ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wangda Tan updated YARN-2594:
-----------------------------
    Attachment: YARN-2594.patch

Attached a fix, doesn't include a test because I found it will be hard to add such tests.
This probably the simplest fix, just remove the read lock when getting currentAppAttempt in
RMApp.

Dead lock is:
1) SchedulerApplicationAttempt (lock *SchedulerApplicationAttempt*) containerComplete ->
RMContainerImpl -> updateAttemptMetrics (lock *RMApp*)

2) RPC:getApplicationReport -> RMAppImpl(lock *RMApp*) ->  ... -> SchedulerApplicationAttempt.getResourceUsageReport
(lock *SchedulerApplicationAttempt*)

Remove any of the four locks will resolve the problem, the fix removes "updateAttemptMetrics
(lock *RMApp*)".

Please kindly review.


> ResourceManger sometimes become un-responsive
> ---------------------------------------------
>
>                 Key: YARN-2594
>                 URL: https://issues.apache.org/jira/browse/YARN-2594
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Karam Singh
>            Assignee: Wangda Tan
>            Priority: Blocker
>         Attachments: YARN-2594.patch
>
>
> ResoruceManager sometimes become un-responsive:
> There was in exception in ResourceManager log and contains only  following type of messages:
> {code}
> 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 53000
> 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 54000
> 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 55000
> 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 56000
> 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 57000
> 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 58000
> 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 59000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message