hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <qwertyman...@gmail.com>
Subject Re: using output from one job as input to another
Date Thu, 03 Mar 2011 04:19:05 GMT

On Thu, Mar 3, 2011 at 7:51 AM, John Sanda <john.sanda@gmail.com> wrote:
> The output path created from the first job is a directory, and it the file
> in that directory that has a name like part-r-0000 that I want to feed as
> input into the second job. I am running in pseudo-distributed mode so I know
> that that file name is going to be the same every run. But in a true
> distributed mode that file name will be different for each node. More over,

The default filename of many OutputFormats start with "part", and is
not node dependent. You will get filenames in out1 as part-r-00000
onwards to part-r-{num. of reduce tasks for your job}.

> when in distributed mode don't I want a uniform view of that output file
> which will be spread across my cluster? Is there something wrong in my code?
> Or can someone point me to some examples that do this?

I do not understand what you mean by uniform view. Using a directory
as an input for a job is very much acceptable and a normal thing to do
in file-based MR. The directories form the whole input, with files
containing small "parts" of it. I do not see anything grossly wrong in
your code provided.

Harsh J

View raw message