hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devaraj k <devara...@huawei.com>
Subject RE: Incrementally adding to existing output directory
Date Thu, 18 Jul 2013 02:51:32 GMT
It seems, It is not taking the CutomOutputFormat for the Job. You need to set the custom output
format class using the org.apache.hadoop.mapred.JobConf.setOutputFormat(Class<? extends
OutputFormat> theClass) API for your Job.

If we don't set OutputFormat for Job, it takes the default as TextOutputFormat which internally
extends FileOutputFormat, that's why you see in the below exception still it is using the
FileOutputFormat.


Thanks
Devaraj k

From: Max Lebedev [mailto:max.l@actionx.com]
Sent: 18 July 2013 01:03
To: user@hadoop.apache.org
Subject: Re: Incrementally adding to existing output directory

Hi Devaraj,

Thank you very much for your help. I've created a CustomOutputFormat which is almost identical
to FileOutputFormat as seen here<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.java>
except I've removed line 125 which throws the FileAlreadyExistsException. However, when I
try to run my code, I get this error:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
outDir already exists
           at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:887)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:396)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
            at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
            at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
            at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
            ...
            at java.lang.reflect.Method.invoke(Method.java:597)
            at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

In my source code, I've changed "FileOutputFormat.setOutputPath" to "CustomOutputFormat.setOutputPath"

Is it the case that FileOutputFormat.checkOutputSpecs is happening somewhere else, or have
I done something wrong?
I also don't quite understand your suggestion about MultipleOutputs. Would you mind elaborating?

Thanks,
Max Lebedev

On Tue, Jul 16, 2013 at 9:42 PM, Devaraj k <devaraj.k@huawei.com<mailto:devaraj.k@huawei.com>>
wrote:
Hi Max,

  It can be done by customizing the output format class for your Job according to your expectations.
You could you refer OutputFormat.checkOutputSpecs(JobContext context) method which checks
the ouput specification. We can override this in your custom OutputFormat. You can also see
MultipleOutputs class for implementation details how it could be done.

Thanks
Devaraj k

From: Max Lebedev [mailto:max.l@actionx.com<mailto:max.l@actionx.com>]
Sent: 16 July 2013 23:33
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Incrementally adding to existing output directory

Hi
I'm trying to figure out how to incrementally add to an existing output directory using MapReduce.
I cannot specify the exact output path, as data in the input is sorted into categories and
then written to different directories based in the contents. (in the examples below, token=AAAA
or token=BBBB)
As an example:
When using MultipleOutput and provided that outDir does not exist yet, the following will
work:
hadoop jar myMR.jar --input-path=inputDir/dt=2013-05-03/* --output-path=outDir
The result will be:
outDir/token=AAAA/dt=2013-05-03/
outDir/token=BBBB/dt=2013-05-03/
However, the following will fail because outDir already exists. Even though I am copying new
inputs.
hadoop jar myMR.jar  --input-path=inputDir/dt=2013-05-04/* --output-path=outDir
will throw FileAlreadyExistsException
What I would expect is that it adds
outDir/token=AAAA/dt=2013-05-04/
outDir/token=BBBB/dt=2013-05-04/
Another possibility would be the following hack but it does not seem to be very elegant:
hadoop jar myMR.jar --input-path=inputDir/2013-05-04/* --output-path=tempOutDir
then copy from tempOutDir to outDir
Is there a better way to address incrementally adding to an existing hadoop output directory?


Mime
View raw message