hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5718) MR AM should tolerate RM restart/failover during commit
Date Wed, 15 Jan 2014 15:37:21 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872189#comment-13872189

Jason Lowe commented on MAPREDUCE-5718:

This is closely related to MAPREDUCE-5485.  The problem here is that the output committer
is user-pluggable code, and we can't assume what it does or if it can be safely restarted
after crashing mid-way through the commit.  This is one of the reasons job commits are not
retried by the AM, and by extension we can't assume it's safe to retry in another AM attempt.
 That's why the AM goes out of its way to indicate via a file that it's starting to do the
job commit and avoids repeating it on an AM restart if that file is still present.  Whether
the retry is because the AM crash or the AM was restarted due to RM restart, the end effect
is the same -- it's not safe to retry a job commit in the general case.

If we had an API by which the output committer could tell the AM if it's safe to retry a job
commit that would help.

> MR AM should tolerate RM restart/failover during commit
> -------------------------------------------------------
>                 Key: MAPREDUCE-5718
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5718
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>              Labels: ha
>         Attachments: mr-5718-0.patch
> While testing RM HA, we ran into this issue where if the RM fails over while an MR AM
is in the middle of a commit, the subsequent AM gets spawned but dies with a diagnostic message
- "We crashed durring a commit". 

This message was sent by Atlassian JIRA

View raw message