hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devaraj k <devara...@huawei.com>
Subject RE: Incrementally adding to existing output directory
Date Wed, 17 Jul 2013 01:42:07 GMT
Hi Max,

  It can be done by customizing the output format class for your Job according to your expectations.
You could you refer OutputFormat.checkOutputSpecs(JobContext context) method which checks
the ouput specification. We can override this in your custom OutputFormat. You can also see
MultipleOutputs class for implementation details how it could be done.

Thanks
Devaraj k

From: Max Lebedev [mailto:max.l@actionx.com]
Sent: 16 July 2013 23:33
To: user@hadoop.apache.org
Subject: Incrementally adding to existing output directory

Hi
I'm trying to figure out how to incrementally add to an existing output directory using MapReduce.
I cannot specify the exact output path, as data in the input is sorted into categories and
then written to different directories based in the contents. (in the examples below, token=AAAA
or token=BBBB)
As an example:
When using MultipleOutput and provided that outDir does not exist yet, the following will
work:
hadoop jar myMR.jar --input-path=inputDir/dt=2013-05-03/* --output-path=outDir
The result will be:
outDir/token=AAAA/dt=2013-05-03/
outDir/token=BBBB/dt=2013-05-03/
However, the following will fail because outDir already exists. Even though I am copying new
inputs.
hadoop jar myMR.jar  --input-path=inputDir/dt=2013-05-04/* --output-path=outDir
will throw FileAlreadyExistsException
What I would expect is that it adds
outDir/token=AAAA/dt=2013-05-04/
outDir/token=BBBB/dt=2013-05-04/
Another possibility would be the following hack but it does not seem to be very elegant:
hadoop jar myMR.jar --input-path=inputDir/2013-05-04/* --output-path=tempOutDir
then copy from tempOutDir to outDir
Is there a better way to address incrementally adding to an existing hadoop output directory?

Mime
View raw message