Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 81447 invoked from network); 11 May 2007 18:43:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 May 2007 18:43:41 -0000 Received: (qmail 70929 invoked by uid 500); 11 May 2007 18:43:44 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 70897 invoked by uid 500); 11 May 2007 18:43:44 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 70860 invoked by uid 99); 11 May 2007 18:43:44 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 May 2007 11:43:44 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 May 2007 11:43:36 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 01702714068 for ; Fri, 11 May 2007 11:43:16 -0700 (PDT) Message-ID: <10872890.1178908996003.JavaMail.jira@brutus> Date: Fri, 11 May 2007 11:43:16 -0700 (PDT) From: "Owen O'Malley (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-1121) Recovering running/scheduled jobs after JobTracker failure In-Reply-To: <13233076.1173942849274.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495128 ] Owen O'Malley commented on HADOOP-1121: --------------------------------------- I'm not happy with the design of the patch. 1. Recoverability should be a property of the job, period. The config variable to disable it at startup in the job tracker would be very error prone. 2. No output directories should be deleted by the framework. Doing otherwise is inconsistent with the rest of the framework. 3. As Doug pointed out, the framework should use the InputFormat and OutputFormat to check the preconditions of the job. This implies that they will load the user's job.jar and should probably be done in a separate process. 4. Using the original job id is _not_ a good idea, unless you block collisions in the job names. It would help a lot if we had objects for ids instead of strings. As a short term solution, I'd suggest that you set the next job id to be one higher that the last recovered job's id. > Recovering running/scheduled jobs after JobTracker failure > ---------------------------------------------------------- > > Key: HADOOP-1121 > URL: https://issues.apache.org/jira/browse/HADOOP-1121 > Project: Hadoop > Issue Type: New Feature > Components: mapred > Environment: all > Reporter: Alejandro Abdelnur > Fix For: 0.14.0 > > Attachments: patch1121.txt > > > Currently all running/scheduled jobs are kept in memory in the JobTracker. If the JobTracker goes down all the running/scheduled jobs have to be resubmitted. > Proposal: > (1) On job submission the JobTracker would save the job configuration (job.xml) in a jobs DFS directory using the jobID as name. > (2) On job completion (success, failure, klll) it would delete the job configuration from the jobs DFS directory. > (3) On JobTracker failure the jobs DFS directory will have all running/scheduled jobs at failure time. > (4) On startup the JobTracker would check the jobs DFS directory for job config files. if there is none it means no failure happened on last stop, there is nothing to be done. If there are job config files in the jobs DFS directory continue with the following recovery steps. > (A) rename all job config files to $JOB_CONFIG_FILE.recover. > (B) for each $JOB_CONFIG_FILE.recover: delete the output directory if it exists, schedule the job using the original job ID, delete the $JOB_CONFIG_FILE.recover (as a new $JOB_CONFIG_FILE will be there per scheduling (per step #1). > (C) when B is completed start accepting new job submissions. > Other details: > A configuration flag would enable/disable the above behavior, if switched off (default behavior) nothing of the above happens. > A startup flag could switch off job recovery for systems with the recover set to ON. > Changes to the job ID generation should be put in place to avoid Job ID collision with jobs IDs from previous failed runs, for example appending a JT startup timestamp to the job IDs would do. > Further improvements on top of this one: > This mechanism would allow having a JobTracker node in standby to be started in case of main JobTracker failure. The standby JobTracker would be started on main JobTracker failure. Making things a little more comprehensive they backup JobTrackers could be running in warm mode and hearbeats and ping calls among them would activate a warm stand by JobTracker as new main JobTracker. Together with an enhancement in the JobClient (keeping a list of backup JobTracker URLs) would enable client fallback to backup JobTrackers. > State about partially run jobs could be kept, tasks completed/in-progress/pending. This would enable to recover jobs half way instead restarting them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.