hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
Date Thu, 03 Jan 2013 06:50:15 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542748#comment-13542748

Bikas Saha commented on MAPREDUCE-4819:

Looks like after the recent changes in JobImpl and the current alternative approach my original
fix for not rerunning the job does not really apply. I think you would want to take the changes
in my patch that adds the jobid to the history staging dir. Since the staging dir is not deleted
during job history flushing, I had observed that if I made my AM crash (by putting an exit(1)
in shutdownJob() then the history files would get orphaned and not cleaned up. Or something
like that. And to fix that I had to add the jobid to the path.
Snippet from my patch.
+++ hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapreduce/v2/jobhistory/JobHistoryUtils.java
@@ -186,10 +186,11 @@ public static PathFilter getHistoryFileFilter() {
    * @return A string representation of the prefix.
   public static String
-      getConfiguredHistoryStagingDirPrefix(Configuration conf)
+      getConfiguredHistoryStagingDirPrefix(Configuration conf, String jobId)
           throws IOException {
     String user = UserGroupInformation.getCurrentUser().getShortUserName();
-    Path path = MRApps.getStagingAreaDir(conf, user);
+    Path stagingPath = MRApps.getStagingAreaDir(conf, user);
+    Path path = new Path(stagingPath, jobId);
     String logDir = path.toString();
     return logDir;

For the patch itself I have a few comments

Why not end in success if the staging dir was cleaned up by the last attempt? I am guessing
that this code wont be necessary after we move the unregister to RM before the staging dir
cleanup in MAPREDUCE-4841, right?
+      if(!stagingExists) {
+        copyHistory = false;
+        isLastAMRetry = true;
+        justShutDown = true;
+        shouldNotify = false;
+        forcedState = JobStateInternal.ERROR;
+        shutDownMessage = "Staging dir does not exist " + stagingDir;
+        LOG.fatal(shutDownMessage);

Why are we only eating/ignoring the JobEvents in the dispatcher? So that the JobImpl state
machine is not triggered?

This might be a question of personal preference. I think an explicit transition to from the
INIT to final state is cleaner than overriding the state in the getter.
   public JobStateInternal getInternalState() {
     try {
+      if(forcedState != null) {
+        return forcedState;
+      }

Didnt quite get this in HistoryFileManager.java. Looks like it related to a recent change
in that code.
+      } else if (old != null && !old.isMovePending()) {
+        //This is a duplicate so just delete it
+        fileInfo.delete();

+        throw new Exception("No handler for regitered for " + type);
+      }

> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
>         Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch,
MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt
> If the AM reports final job status to the client but then crashes before unregistering
with the RM then the RM can run another AM attempt.  Currently AM re-attempts assume that
the previous attempts did not reach a final job state, and that causes the job to rerun (from
scratch, if the output format doesn't support recovery).
> Re-running the job when we've already told the client the final status of the job is
bad for a number of reasons.  If the job failed, it's confusing at best since the client was
already told the job failed but the subsequent attempt could succeed.  If the job succeeded
there could be data loss, as a subsequent job launched by the client tries to consume the
job's output as input just as the re-attempt starts removing output files in preparation for
the output commit.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message