hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3274) Race condition in MR App Master Preemtion can cause a dead lock
Date Thu, 27 Oct 2011 15:46:32 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137214#comment-13137214
] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3274:
----------------------------------------------------

That is one monster of a race!

I think the problem is this: Today we treat REMOTE_LAUNCH and REMOTE_CLEANUP events for the
same container as distinct unrelated events in ContainerLauncherImpl. We need to handle them
as related, and take action depending on whether container is launched or not. Code fixes
in MAPREDUCE-3240 for the NodeManager immediately come to my mind which are similar but for
cleaning up container processes on NM.

bq. It might be good if the code that informs the Container what to do could know about killed
attempts and if for some reason they ask for something to do they are told to die.
The infrastructure is already there for doing this. It is supposed to work if not for bugs
:) See TaskAttempListenerImpl (+411) which dishes out tasks, it is supposed to ask them to
die if it doesn't know them. Two things we can do for this:
 - TaskAttempt should register with TaskAttemptListener even *before* the container is launched.
Today the registration happens only after the container launches.
 - It should register with TaskAttemptListener.taskHeartBeatHandler *after* the container
is launched so that heartBeatHandler doesn't start counting down even before the container
is launched.
 - And of course, fix the obvious bug, that is send a DIE to the task, if it is not registered
with TaskAttemptListener.
                
> Race condition in MR App Master Preemtion can cause a dead lock
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-3274
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3274
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.0, 0.24.0
>
>
> There appears to be a race condition in the MR App Master in relation to preempting reducers
to let a mapper run.  In the particular case that I have been debugging a reducer was selected
for preemption that did not have a container assigned to it yet. When the container became
available that reduce started running and the previous TA_KILL event appears to have been
ignored.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message