Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Message-ID: <7937461.1157134523552.JavaMail.jira@brutus>
Date: Fri, 1 Sep 2006 11:15:23 -0700 (PDT)
From: "Bryan Pendleton (JIRA)" <jira@apache.org>
To: hadoop-dev@lucene.apache.org
Subject: [jira] Commented: (HADOOP-223) Should be able to re-run jobs,
 collecting only missing output
In-Reply-To: <30650123.1147803065900.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

    [ http://issues.apache.org/jira/browse/HADOOP-223?page=comments#action_12432196 ] 
            
Bryan Pendleton commented on HADOOP-223:
----------------------------------------

This issue still plagues me. I have a large enough dataset, that processing through it and storing temporary outputs often causes many failures. I still have my local fix for re-running (very cumbersome). I'd love to see a general solution supported.

What I currently do is to inject a special combiner, that only produces output for partitions that didn't complete in a previous run. Thus, if 20% of my job finishes (completes writing reduce output to DFS), I need 20% less working space to run the subsequent job attempt. Usually, this is enough progress.

However, that suffers from a lot of limits:
1) There's no auto-magic storage of what partitions were generated.
2) This loses the ability to have a Combiner doing something else
3) It also requires a new output directory, and hand-reassembly of the completed tasks.

It seems like it should be solvable more generally by:
1) Storing the partitions that are generated when a job is initially run.
2) Keeping track of what partitions have completed.
3) Allowing a job to be re-submitted, using the previous partitions (and not failing on the output existance check)

Anyone else run into these issues? I'd be happy to write up how I actually work around the problem now, but it seems like a more general solution would be helpful to a wider audience.

> Should be able to re-run jobs, collecting only missing output
> -------------------------------------------------------------
>
>                 Key: HADOOP-223
>                 URL: http://issues.apache.org/jira/browse/HADOOP-223
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Bryan Pendleton
>            Priority: Minor
>
> For jobs with no side effects (roughly == jobs with speculative execution enabled), if partial output has been generated, it should be possible to re-run the job, and fill in the missing pieces. I have now run the same job twice, once finishing 42 of 44 reduce tasks, another time finishing only 17. Each time, many nodes have failed, causing many many tasks to fail ( in one case, 5k failures from 15k map tasks, 23 failures from 44 reduces), but some valid output was generated. Since the output is only dependent on the input, and both jobs used the same input, I will now be able to combine these two failed task outputs to get a completed job's output. This should be something that can be more automatic.
> In particular, it should be possible to resubmit a job, with a list of partitions that should be ignored. A special Combiner, or pre-Combiner, would throw out any map output for partitions that have already been successfully completed, thus reducing the amount of data that needs to be reduced to complete the job. It would, of course, be nice to support "filling in" existing outputs, rather than having to do a move operation on completed outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira