hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samir Eljazovic <samir.eljazo...@gmail.com>
Subject Re: Is there a way to re-use the output of job which was killed?
Date Thu, 01 Dec 2011 00:08:34 GMT
Hi Harsh,
thanks for the answer. This is the same approach I was thinking of. But,
let me try to give you some more details about the problem and my proposal
for its solution.

The problem I'm trying to solve can be defined as "restarting the job which
was killed without re-processing data". As I know, this is something that
is not supported off-the-shelf by Hadoop so it requires additional coding
and configuration changes to make it work.

As you mentioned early, the first step would be to keep the first job's
output in output directory by overriding
outputFormat checkOutputSpecs method.

The next step is to keep the track of input splits which were successfully
processed by first job. The idea I have is to create empty file in task's
working directory with name equals to "InputFileName+Offset+Length" (result
of FileSplit.toString()) in mapper setup() method. When task completes
successfully the output files and this "meta" file would be copied to
output directory. This way we would know which input splits were
successfully completed by first job.

Then, we have to change FileInputFormat (or RecordReader?) to check if
"meta" file exists in output directory so it knows that particular input
split was already processed and it can be skipped. This way we can re-use
output of all successful tasks from first job.

But, taking this idea further more to use the output of tasks which were in
running state when job was killed (tasks with incomplete result) would give
us ability to fully reuse the work from job which was killed.

Again, there are some pre-conditions we have to meet like not to delete
temporary directory for any failed task.
Using the data from task's temporary directory, we can find the last key
which was successfully processed before task was killed. Once when we know
the key record reader can skip all KV pairs until it finds the last key
from killed task.

It is important to mention that this approach could work only for MR jobs
without reducers as job output is written directly to HDFS instead of local
file system.

I haven't tried this approach so I cannot guarantee it will work. Please
let me know your ideas and thoughts about all this.


On 25 November 2011 02:58, Harsh J <harsh@cloudera.com> wrote:

> Samir,
> This should be possible. One way is:
> Your custom RecordReader initializations would need to check if a file
> exists before it tries to create one, and upon existence it needs to simply
> pass through with 0 records to map(…) -- thereby satisfying what you want
> to do.
> You may also want to remove away output directory existence checks from
> your subclassed FileOutputFormat (Override #checkOutputSpecs).
> On 25-Nov-2011, at 5:24 AM, Samir Eljazovic wrote:
> > Hi all,
> > I was wandering if there is a off-the-shelf solution to re-use the
> output of the job which was killed when re-running the job?
> >
> > Here's my use-case: Job (with map phase only) is running and has 60% of
> its work completed before it gets killed. Output files from successfully
> completed tasks will be created in specified output directory. The next
> time when I re-run this job using same input data I would like to re-use
> those files to skip processing data which was already processed.
> >
> > Do you know if something similar exists and what would be right way to
> do it?
> >
> > Thanks,
> > Samir

View raw message