hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3274) Race condition in MR App Master Preemtion can cause a dead lock
Date Thu, 27 Oct 2011 14:38:32 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137182#comment-13137182
] 

Robert Joseph Evans commented on MAPREDUCE-3274:
------------------------------------------------

Yes the JVM thing was a red herring.

The issue is that on the AM.  The events were processed in the following order
CONTINER_REMOTE_LAUNCH
TA_KILL

But on the NM they were processed in reverse.
Stop Container Request (Error)
Start Container Request (Success)

The Stop Request was processed 4 ms before the Start Request was.


I need to read through the code some more to try to understand how to handle this.  Just my
gut feeling would be that we need a way to handle an error in a Stop Container Request.  We
may need an event back indicating that the TA_KILL failed. Perhaps we could retry it a few
times before giving up instead of the event back.

Also the container launched and started talking to the App Master requesting something to
do.  The App Master always responded with I have nothing for you to do.  It might be good
if the code that informs the Container what to do could know about killed attempts and if
for some reason they ask for something to do they are told to die.  This seems like a good
way to prevent this type of error in the future.
                
> Race condition in MR App Master Preemtion can cause a dead lock
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-3274
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3274
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, scheduler
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>             Fix For: 0.23.0, 0.24.0
>
>
> There appears to be a race condition in the MR App Master in relation to preempting reducers
to let a mapper run.  In the particular case that I have been debugging a reducer was selected
for preemption that did not have a container assigned to it yet. When the container became
available that reduce started running and the previous TA_KILL event appears to have been
ignored.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message