hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Abdelnur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1121) Recovering running/scheduled jobs after JobTracker failure
Date Mon, 14 May 2007 03:10:16 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495450

Alejandro Abdelnur commented on HADOOP-1121:

* On Owen's #1 comment, startup flag

Yes, Recoverability is a job property at scheduling time, but default is FALSE.

There is another JT property to disable recovering all jobs at start up, currently its default
FALSE, it should be TRUE, it should be there just for admin purposes, when for some reason
the admin wants to do a fresh start regardless (like in Databases forcing a rollback o commit
of stuff in the logs).

* On Owen's #2 comment, output directory deletion

I may be missing something, but at task (a single map or reduce) failure, when the JT restarts
the faile task, it does a clean up of whatever output the failed task did. Right?

* On Doug's and Owen's #3 comment, Inputformat and OutputFormat

I'll have to look into this, not sure what you mean.

* On Owen's #4 comment, on original job id

Jobs scheduled with autorecovery on have a job ID of the form 'job_TIMESTAMP_####', for example
'job_20070511075754_0002'. Where TIMESTAMP is the time up to seconds when the JT was started.

This uniqueness serves two purposes:

1. There are not job ID collissions.
2. Systems tracking jobs can find the status of a job ID recovered after a failure.

> Recovering running/scheduled jobs after JobTracker failure
> ----------------------------------------------------------
>                 Key: HADOOP-1121
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1121
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>         Attachments: patch1121.txt
> Currently all running/scheduled jobs are kept in memory in the JobTracker. If the JobTracker
goes down all the running/scheduled jobs have to be resubmitted.
> Proposal:
> (1) On job submission the JobTracker would save the job configuration (job.xml) in a
jobs DFS directory using the jobID as name.
> (2) On job completion (success, failure, klll) it would delete the job configuration
from the jobs DFS directory.
> (3) On JobTracker failure the jobs DFS directory will have all running/scheduled jobs
at failure time.
> (4) On startup the JobTracker would check the jobs DFS directory for job config files.
if there is none it means no failure happened on last stop, there is nothing to be done. If
there are job config files in the jobs DFS directory continue with the following recovery
> (A) rename all job config files to $JOB_CONFIG_FILE.recover.
> (B) for each $JOB_CONFIG_FILE.recover: delete the output directory if it exists, schedule
the job using the original job ID, delete the $JOB_CONFIG_FILE.recover (as a new $JOB_CONFIG_FILE
will be there per scheduling (per step #1).
> (C) when B is completed start accepting new job submissions.
> Other details:
> A configuration flag would enable/disable the above behavior, if switched off (default
behavior) nothing of the above happens.
> A startup flag could switch off job recovery for systems with the recover set to ON.
> Changes to the job ID generation should be put in place to avoid Job ID collision with
jobs IDs from previous failed runs, for example appending a JT startup timestamp to the job
IDs would do.
> Further improvements on top of this one:
> This mechanism would allow having a JobTracker node in standby to be started in case
of main JobTracker failure. The standby JobTracker would be started on main JobTracker failure.
Making things a little more comprehensive they backup JobTrackers could be running in warm
mode and hearbeats and ping calls among them would activate a warm stand by JobTracker as
new main JobTracker. Together with an enhancement in the JobClient (keeping a list of backup
JobTracker URLs) would enable client fallback to backup JobTrackers.
> State about partially run jobs could be kept, tasks completed/in-progress/pending. This
would enable to recover jobs half way instead restarting them. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message