hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.
Date Mon, 14 Apr 2014 18:43:15 GMT

    [ https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968653#comment-13968653

Steve Loughran commented on YARN-1929:

I'm +1 to the change to composite service, as well as making the serviceXYZ operations desyncrhonized
(the state entry point in the public method is synchronized to prevent re-entrancy.

I'll leave it to others to look at the remaining code and comment

Now, there is one little quirk by desynchronizing the serviceStart() and serviceStop methods.
Although it is still impossible to have >1 thread successfully entering either method,
there is the sequence

Thread 1 : service.start()
Thread 1:  service.serviceStart() begins

Thread 2 : service.stop()
Thread 2:  service.serviceStop() begins
Thread 2:  service.serviceStop() completes

Thread 1: service start completes

That's because we're not making any attempt to include transitive states, it generally makes
things too complex -and that includes handling the problem of "what is the policy if I try
to call stop midway through starting"

> DeadLock in RM when automatic failover is enabled.
> --------------------------------------------------
>                 Key: YARN-1929
>                 URL: https://issues.apache.org/jira/browse/YARN-1929
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>         Environment: Yarn HA cluster
>            Reporter: Rohith
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>         Attachments: yarn-1929-1.patch
> Dead lock detected  in RM when automatic failover is enabled.
> {noformat}
> Found one Java-level deadlock:
> =============================
> "Thread-2":
>   waiting to lock monitor 0x00007fb514303cf0 (object 0x00000000ef153fd0, a org.apache.hadoop.ha.ActiveStandbyElector),
>   which is held by "main-EventThread"
> "main-EventThread":
>   waiting to lock monitor 0x00007fb514750a48 (object 0x00000000ef154020, a org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService),
>   which is held by "Thread-2"
> {noformat}

This message was sent by Atlassian JIRA

View raw message