Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 22097 invoked from network); 8 Dec 2009 21:39:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Dec 2009 21:39:53 -0000 Received: (qmail 26056 invoked by uid 500); 8 Dec 2009 21:39:53 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 25992 invoked by uid 500); 8 Dec 2009 21:39:52 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 25983 invoked by uid 99); 8 Dec 2009 21:39:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Dec 2009 21:39:52 +0000 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.212.174] (HELO mail-vw0-f174.google.com) (209.85.212.174) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Dec 2009 21:39:43 +0000 Received: by vws4 with SMTP id 4so2753520vws.2 for ; Tue, 08 Dec 2009 13:39:22 -0800 (PST) MIME-Version: 1.0 Received: by 10.220.121.131 with SMTP id h3mr9914469vcr.102.1260308362231; Tue, 08 Dec 2009 13:39:22 -0800 (PST) In-Reply-To: <9ec65d4c0912081240u1f4e7d7aka621ce5e7fdab259@mail.gmail.com> References: <9ec65d4c0909291344g1119fe55o33d443a18b2327dc@mail.gmail.com> <9ec65d4c0912081240u1f4e7d7aka621ce5e7fdab259@mail.gmail.com> From: Aaron Kimball Date: Tue, 8 Dec 2009 13:39:02 -0800 Message-ID: Subject: Re: Does Using MultipleTextOutputFormat Require the Deprecated API? To: mapreduce-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016369c90713d0123047a3e685b X-Virus-Checked: Checked by ClamAV on apache.org --0016369c90713d0123047a3e685b Content-Type: text/plain; charset=ISO-8859-1 Geoffry, There are two MultipleOutputs implementations; one for the new API, one for the old one. The new API (org.apache.hadoop.mapreduce.lib.output.MultipleOutputs) does not have a getCollector() method. This is intended to work with org.apache.hadoop.mapreduce.Mapper and its associated Context object. The old API implementation of MO (org.apache.hadoop.mapred.lib.MultipleOutputs) is intended to work with org.apache.hadoop.mapred.Mapper, Reporter, and friends. If you're going to use the new org.apache.hadoop.mapreduce-based code, you should not need to import anything in the mapred package. That having been said -- I just realized that the new-API-compatible MultipleOutputs implementation is not in Hadoop 0.20. It's only in the unreleased 0.21. If you're using 0.20, you should probably stick with the old API for your process. Cheers, - Aaron On Tue, Dec 8, 2009 at 12:40 PM, Geoffry Roberts wrote: > All, > > This one has me stumped. > > What I want to do is output from my reducer multiple files, one for each > key value. I also want to avoid any deprecated parts of the API. > > As suggested, I switched from using MultipleTextOutputFormat to > MultipleOutputs but have run into an impasse. MultipleOutputs' getCollector > method requires a Reporter as a parameter, but as far as I can tell, the API > doesn't support this. The only reporter I can find is in the context > object, but is declared protected. > > Am I stuck? or just missing something? > > My code: > > @Override > public void reduce(Text key, Iterable values, Context context) > throws IOException { > String fileName = key.toString(); > MultipleOutputs.addNamedOutput((JobConf) > context.getConfiguration(), fileName, OutputFormat.class, Text.class, > Text.class); > mos = new MultipleOutputs((JobConf) > context.getConfiguration()); > for (Text line : values) { > > // This is the problem line: > mos.getCollector(fileName, ).collect( > key, line); > } > > mos.close(); > > } > > On Mon, Oct 5, 2009 at 11:17 AM, Aaron Kimball wrote: > >> Geoffry, >> >> The new API comes with a related OF, called MultipleOutputs >> (o.a.h.mapreduce.lib.output.MultipleOutputs). You may want to look into >> using this instead. >> >> - Aaron >> >> >> On Tue, Sep 29, 2009 at 4:44 PM, Geoffry Roberts < >> geoffry.roberts@gmail.com> wrote: >> >>> All, >>> >>> What I want to do is output from my reducer multiple files one for each >>> key value. >>> >>> Can this still be done in the current API? >>> >>> It seems that using MultipleTextOutputFormat requires one to use >>> deprecated parts of API. >>> >>> It this correct? >>> >>> I would like to use the class or its equivalent and stay off anything >>> deprecated. >>> >>> Is there a work around? >>> >>> In the current API one uses Job and a class derived from the classorg.apache.hadoop.mapreduce.OutputFormat. >>> MultipleTextOutputFormat does not derive from this class. >>> >>> Job.setOutputFormatClass(Class>> OutputFormat>); >>> >>> >>> In the Old, deprecated API, one uses JobConf and an implementation of the >>> interface org.apache.hadoop.mapred.OutputFormat. >>> MultipleTextOutputFormat is just such an implementation. >>> >>> JobConf.setOutputFormat(Class>> OutputFormat); >>> >> >> > --0016369c90713d0123047a3e685b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Geoffry,

There are two MultipleOutputs implementations; one for the = new API, one for the old one.

The new API (org.apache.hadoop.mapredu= ce.lib.output.MultipleOutputs) does not have a getCollector() method. This = is intended to work with org.apache.hadoop.mapreduce.Mapper and its associa= ted Context object.

The old API implementation of MO (org.apache.hadoop.mapred.lib.Multiple= Outputs) is intended to work with org.apache.hadoop.mapred.Mapper, Reporter= , and friends.

If you're going to use the new org.apache.hadoop.= mapreduce-based code, you should not need to import anything in the mapred = package. That having been said -- I just realized that the new-API-compatib= le MultipleOutputs implementation is not in Hadoop 0.20. It's only in t= he unreleased 0.21. If you're using 0.20, you should probably stick wit= h the old API for your process.

Cheers,
- Aaron

On Tue, Dec 8, 200= 9 at 12:40 PM, Geoffry Roberts <geoffry.roberts@gmail.com> wrote:
<= blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, 2= 04, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> All,

This one has me stumped.

What I want to do is output fro= m my reducer multiple files, one for each key value. I also want to avoid a= ny deprecated parts of the API.

As suggested, I switched from using = MultipleTextOutputFormat to MultipleOutputs but have run into an impasse.= =A0 MultipleOutputs' getCollector method requires a Reporter as a param= eter, but as far as I can tell, the API doesn't support this.=A0 The on= ly reporter I can find is in the context object, but is declared protected.=

Am I stuck? or just missing something?

My code:

@Override=
public void reduce(Text key, Iterable<Text> values, Context conte= xt)
=A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 throws IOException {
Stri= ng fileName =3D key.toString();
=A0=A0=A0=A0 =A0=A0=A0 =A0=A0=A0 MultipleOutputs.addNamedOutput((JobConf) c= ontext.getConfiguration(), fileName, OutputFormat.class, Text.class, Text.c= lass);
=A0=A0=A0 =A0=A0=A0 =A0=A0=A0 mos =3D new MultipleOutputs((JobCon= f) context.getConfiguration());
=A0=A0=A0 =A0=A0=A0 =A0=A0=A0 for (Text line : values) {

// This is the problem line:
=A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0= mos.getCollector(fileName, <reporter goes here>).collect(
=A0=A0= =A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 key, line);
=A0=A0= =A0 =A0=A0=A0 =A0=A0=A0 }

=A0=A0=A0 =A0=A0=A0 =A0=A0=A0 mos.close();=

=A0=A0=A0=A0 =A0=A0=A0 }=A0=A0=A0 =A0=A0=A0

On Mon, Oct 5, 2009 at 11:17 AM, Aaron Kimba= ll <aaron@cloudera.com> wrote:
Geoffry,

The new API comes with a related OF, called MultipleOutputs= (o.a.h.mapreduce.lib.output.MultipleOutputs). You may want to look into us= ing this instead.

- Aaron


On Tue, Sep 29, 2009 at 4:44 = PM, Geoffry Roberts <geoffry.roberts@gmail.com> wrot= e:
All,

What = I want to do is output from my reducer multiple files one for each key valu= e.

Can this still be done in the current API?

It seems that using Multi= pleTextOutputFormat requires one to use deprecated parts of API.

It = this correct?

I would like to use the class or its equivalent and st= ay off anything deprecated.

Is there a work around?

In the current API one uses Job and a cl= ass derived from the class org.apache.hadoop.mapreduce.OutputFormat.=A0 MultipleTextOutputFormat does not derive from this cl= ass.

Job.setOutputFormatClass(Class<? extends org.apach= e.hadoop.mapreduce.OutputFormat>);


In the Old, depreca= ted API, one uses JobConf and an implementation of the interface org.apache.hadoop.mapred.OutputFormat.=A0 MultipleTextOutput= Format is just such an implementation.

JobConf.setOutputFormat(Class<? extends org.apache= .hadoop.mapred .OutputFormat);



--0016369c90713d0123047a3e685b--