hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xavier Stevens <Xavier.Stev...@fox.com>
Subject RE: Chaining M/R Jobs
Date Mon, 26 Apr 2010 18:39:27 GMT
I don't usually bother renaming the files.  If you know you want all of
the files, you just iterate over the files in the output directory from
the previous job.  And then add those to distributed cache.  If the data
is fairly small you can set the number of reducers to 1 on the previous
step as well.


-----Original Message-----
From: Eric Sammer [mailto:esammer@cloudera.com] 
Sent: Monday, April 26, 2010 11:33 AM
To: common-user@hadoop.apache.org
Subject: Re: Chaining M/R Jobs

The easiest way to do this is to write your job outputs to a known
place and then use the FileSystem APIs to rename the part-* files to
what you want them to be.

On Mon, Apr 26, 2010 at 2:22 PM, Tiago Veloso <ti.veloso@gmail.com>
> Hi,
> I'm trying to find a way to control the output file names. I need this
because I have a situation where I need to run a Job and then use it's
output in the DistributedCache.
> So far the only way I've seen that makes it possible is rewriting the
OutputFormat class but that seems a lot of work for such a simple task.
Is there any way to do what I'm looking for?
> Tiago Veloso
> ti.veloso@gmail.com

Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

View raw message