hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Kozlov <ale...@cloudera.com>
Subject Re: Chaining M/R Jobs
Date Mon, 26 Apr 2010 19:00:31 GMT
You can use MultipleOutputs for this purpose, even though it was not
designed for this and a few people on this list are going to raise an
eyebrow.

Alex K

On Mon, Apr 26, 2010 at 11:39 AM, Xavier Stevens <Xavier.Stevens@fox.com>wrote:

> I don't usually bother renaming the files.  If you know you want all of
> the files, you just iterate over the files in the output directory from
> the previous job.  And then add those to distributed cache.  If the data
> is fairly small you can set the number of reducers to 1 on the previous
> step as well.
>
>
> -Xavier
>
>
> -----Original Message-----
> From: Eric Sammer [mailto:esammer@cloudera.com]
> Sent: Monday, April 26, 2010 11:33 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Chaining M/R Jobs
>
> The easiest way to do this is to write your job outputs to a known
> place and then use the FileSystem APIs to rename the part-* files to
> what you want them to be.
>
> On Mon, Apr 26, 2010 at 2:22 PM, Tiago Veloso <ti.veloso@gmail.com>
> wrote:
> > Hi,
> >
> > I'm trying to find a way to control the output file names. I need this
> because I have a situation where I need to run a Job and then use it's
> output in the DistributedCache.
> >
> > So far the only way I've seen that makes it possible is rewriting the
> OutputFormat class but that seems a lot of work for such a simple task.
> Is there any way to do what I'm looking for?
> >
> > Tiago Veloso
> > ti.veloso@gmail.com
> >
> >
> >
> >
>
>
>
> --
> Eric Sammer
> phone: +1-917-287-2675
> twitter: esammer
> data: www.cloudera.com
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message