hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: Running parallel jobs having the same output directory
Date Tue, 21 Jul 2009 00:02:45 GMT
There's likely another gotcha regarding the fact that various logs and job
config files are written to the _logs directory under the output directory.
You'd need to uniquify that as well. There may be other traps, but I don't
know them :)

This might be a bit of a frustrating endeavour since you're trying to
override behaviour that's been baked in to Hadoop for a while. Why in
particular do you need all your jobs to emit to a common directory? You
could probably save yourself some headache by writing to subdirectories of a
common dir.

e.g., rather than having jobs 0..n write to /user/foo/commonoutput, just
write to /user/foo/outputs/0, /user/foo/outputs/1, etc..

If you need to collect the various outputs together to use in a subsequent
MR job, you can use FileInputFormat.addInputPath() multiple times on the
various directories. Or you could modify other downstream logic of yours to
either recursively descend a level into a hierarchy, or use
FileSystem.rename() to move the files from the different directories into a
single aggregate directory after all the jobs have succeeded.

- Aaron

On Mon, Jul 20, 2009 at 11:51 AM, Thibaut_ <tbritz@blue.lu> wrote:

> Hi,
> I'm trying to run a few parallel jobs which have the same input directory
> and the same output directory.
> I modified the FileInputClass to check for non zero files, and also
> modified
> the output class to allow non empty directories (the input directory =
> output directory in my case). I made sure that each job output is unique,
> thus there are no file conflicts there.
> Everything runs fine for a while, but I'm having problems with the
> temporary
> directory:
> java.io.IOException: The temporary job-output directory
> hdfs://internal1:50010/user/root/0/_temporary doesn't exist!
> I could go further down and try to make the _temporary directory job
> dependent. But before I do that, I would like to know if there are other
> traps/errors I could run into running parallel jobs having the same
> output/input directory?
> (Btw this is hadoop-0.20.0)
> Thanks,
> Thibaut
> --
> View this message in context:
> http://www.nabble.com/Running-parallel-jobs-having-the-same-output-directory-tp24575402p24575402.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message