hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siddharth Seth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4832) MR AM can get in a split brain situation
Date Thu, 03 Jan 2013 22:06:13 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543334#comment-13543334
] 

Siddharth Seth commented on MAPREDUCE-4832:
-------------------------------------------

Was talking to Hitesh offline about this patch. Is this needed at the moment ? Seems like
it's possible to avoid multiple AMs by tuning the AM_LIVENESS_INTERVAL (10 minutes by default)
and MR_AM_TO_RM_WAIT_INTERVAL_MS (6 minutes by default). A new AM should only be started after
the existing AM is done.
 
That said, this is definitely an interesting approach to fix the problem.
- Could add a check to ensure the window interval is greater than the AM-RM heartbeat.
- Does getClock() need to be part of the RMHeartbeatHandler. Looks like the AppContext can
provide this - I think a couple of places use the AppContext, others use th RMHeartbeatHandler.

Recovery and restart are still WIP. I believe the  MR_AM_TO_RM_WAIT_INTERVAL_MS will need
to be looked at again in context of recovery. This patch, or a sync via hdfs seems more useful
at that point ?
                
> MR AM can get in a split brain situation
> ----------------------------------------
>
>                 Key: MAPREDUCE-4832
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Robert Joseph Evans
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4832.patch
>
>
> It is possible for a networking issue to happen where the RM thinks an AM has gone down
and launches a replacement, but the previous AM is still up and running.  If the previous
AM does not need any more resources from the RM it could try to commit either tasks or jobs.
 This could cause lots of problems where the second AM finishes and tries to commit too. 
This could result in data corruption.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message