hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Kambatla (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-5718) MR AM should tolerate RM restart/failover during commit
Date Tue, 14 Jan 2014 17:44:53 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Karthik Kambatla updated MAPREDUCE-5718:

    Attachment: mr-5718-0.patch

First-cut patch that deletes the startCommitFile if the commit is interrupted. 

However, in case of two AMs running during a partition, this can lead to one AM deleting the
startCommitFile created by another AM. To avoid races in case of a partition, we might have
to complicate this a little more. 

How about adding a .host.pid suffix to the name of the commit file? Each AM would write its
own. When a subsequent AM comes up and verifies the state of commit from previous AMs, it
would look for any? [~vinodkv], [~revans2] - thoughts? 

> MR AM should tolerate RM restart/failover during commit
> -------------------------------------------------------
>                 Key: MAPREDUCE-5718
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5718
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>              Labels: ha
>         Attachments: mr-5718-0.patch
> While testing RM HA, we ran into this issue where if the RM fails over while an MR AM
is in the middle of a commit, the subsequent AM gets spawned but dies with a diagnostic message
- "We crashed durring a commit". 

This message was sent by Atlassian JIRA

View raw message