hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Max Lebedev <ma...@actionx.com>
Subject Incrementally adding to existing output directory
Date Tue, 16 Jul 2013 18:02:53 GMT
Hi

I'm trying to figure out how to incrementally add to an existing output
directory using MapReduce.

I cannot specify the exact output path, as data in the input is sorted into
categories and then written to different directories based in the contents.
(in the examples below, token=AAAA or token=BBBB)

As an example:

When using MultipleOutput and provided that outDir does not exist yet, the
following will work:

hadoop jar myMR.jar
--input-path=inputDir/dt=2013-05-03/* --output-path=outDir

The result will be:

outDir/token=AAAA/dt=2013-05-03/

outDir/token=BBBB/dt=2013-05-03/

However, the following will fail because outDir already exists. Even though
I am copying new inputs.

hadoop jar myMR.jar  --input-path=inputDir/dt=2013-05-04/*
--output-path=outDir

will throw FileAlreadyExistsException

What I would expect is that it adds

outDir/token=AAAA/dt=2013-05-04/

outDir/token=BBBB/dt=2013-05-04/

Another possibility would be the following hack but it does not seem to be
very elegant:

hadoop jar myMR.jar --input-path=inputDir/2013-05-04/*
--output-path=tempOutDir

then copy from tempOutDir to outDir

Is there a better way to address incrementally adding to an existing hadoop
output directory?

Mime
View raw message