hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4813) AM timing out during job commit
Date Thu, 29 Nov 2012 08:52:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506328#comment-13506328

Vinod Kumar Vavilapalli commented on MAPREDUCE-4813:

Some comments on the patch:
 - Similar to JobCommitFailedEvent, add an event class for JOB_COMMIT_COMPLETED.
 - JobImpl.checkJobCompleteSuccess() and corresponding return variables should be renamed
to mean checkIfJobReadyForCommit(). Similary, checkJobForCompletion(job).
 - For now, we may be just be addressing MAPREDUCE-4815, but the same argument of committer
being arbitrary user code is valid for other calls like abortJob, setupJob too. We will need
states capturing those calls and put them on separate threads so that dispatches isn't blocked.
We can do that later, but to be future-proof, let's move the committer-thread to a top-level
service ala TaskCleaner. We may even re-purpose TaskCleanerImpl for this. Scope the effort
and split it as you see fit.
 - Commit-thread interrupting and joining is only meaning-ful in the case of kill-during-commit.
So let's move that code there. Also, earlier, we never supported kill-during-commit, but now
we do and the patch is putting a 60second upper bound on commitJob() before abortJob(). Comparing
this with 1.*, we do allow kill-during-commit as commit happens in a separate JVM. So interrupt
and join seems fine, let's just put in a config so that we can tweak if ever there is a need.
 - The test looks good. Can you extend it to include kill-during-commit too. That will also
validate that the dispatcher isn't blocked anymore because of long commit.
> AM timing out during job commit
> -------------------------------
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
> The AM calls the output committer's {{commitJob}} method synchronously during JobImpl
state transitions, which means the JobImpl write lock is held the entire time the job is being
committed.  Holding the write lock prevents the RM allocator thread from heartbeating to the
RM.  Therefore if committing the job takes too long (e.g.: the job has tons of files to commit
and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the
RM kills the AM attempt.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message