hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Kambatla (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive
Date Thu, 25 Sep 2014 18:00:34 GMT

    [ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148049#comment-14148049
] 

Karthik Kambatla commented on YARN-2594:
----------------------------------------

Thanks for working on this, Wangda. 

As I see, we could adopt the approach in the current patch. If we do so, we should avoid using
readLock in other get methods that access {{RMAppImpl#currentAttempt}}. {{RMAppAttemptImpl}}
should handle the thread-safety of its fields.

Either in addition to or instead of current approach, we really need to cleanup {{SchedulerApplicationAttempt}}.
Most of the methods there are synchronized, and many of them just call synchronized methods
in {{AppSchedulingInfo}}. Needless to say, this is more involved and we need to be very careful.


I am open to adopting the first approach in this JIRA and file follow-up JIRAs to address
the second approach suggested. 

PS: We really need to set up jcarder or something to identify most of these deadlock paths.


> ResourceManger sometimes become un-responsive
> ---------------------------------------------
>
>                 Key: YARN-2594
>                 URL: https://issues.apache.org/jira/browse/YARN-2594
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Karam Singh
>            Assignee: Wangda Tan
>            Priority: Blocker
>         Attachments: YARN-2594.patch
>
>
> ResoruceManager sometimes become un-responsive:
> There was in exception in ResourceManager log and contains only  following type of messages:
> {code}
> 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 53000
> 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 54000
> 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 55000
> 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 56000
> 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 57000
> 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 58000
> 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher (AsyncDispatcher.java:handle(232))
- Size of event-queue is 59000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message