Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of sonalgoyal4@gmail.com
 designates 209.85.160.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=TugTANBi+TuohvQKzzryGFMKpIVWE8OERnwFCUD8WCsJ8nwVDtp1Zr58URh5qkzYtp
         b5joJ5D2goBBiJ6IxBG97OUmv6Sxu/yhOsf0b8ZDBYmX/V9WSMnrymQ0ZeGGdWjADRk8
         rfh6qYsyM8Fl31N3Il+izzytTYt8JlEf6G+cs=
MIME-Version: 1.0
In-Reply-To: <201005101208.o4AC8Tq0005383@post.webmailer.de>
References: <201005101208.o4AC8Tq0005383@post.webmailer.de>
Date: Mon, 10 May 2010 20:59:53 +0530
Message-ID: <AANLkTik-o_ztZ0vsHHETwmpbBAMIYscu7yQwam86V02k@mail.gmail.com>
Subject: Re: MultipleOutputs or Partitioner
From: Sonal Goyal <sonalgoyal4@gmail.com>
To: mapreduce-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=00504502cd119a706a04863f14f5

--00504502cd119a706a04863f14f5
Content-Type: text/plain; charset=ISO-8859-1

Hi Alan,

You can use MultipleOutputFormat. You can override the
generateFileName...methods to get the functionality you want.

A partitioner controls how data moves from the mapper to the reducer, so if
you take that approach, you will have to specify the number of reducers as
the number of files you want, which is not the best option if some days have
more data than the others. You also dont have control over the file name.
See Tom White's Hadoop The Definitive Guide for an excellent example and
usage.

Thanks and Regards,
Sonal
www.meghsoft.com


On Mon, May 10, 2010 at 5:38 PM, Some Body <somebody@squareplanet.de> wrote:

> Hi,
>
> I'm trying to understand how to generate multiple outputs in my reducer
> (using 0.20.2+228).
> Do I need MultipleOutput or should I partition my output in the mapper?
>
> My reducer currently gets key/val input pairs like this which all end up in
> my part_r_0000 file.
>
>    hostA_VarX_2010-05-01_morning    <FLOATVAL>
>    hostA_VarY_2010-05-01_morning    <FLOATVAL>
>    hostA_VarX_2010-05-01_afternoon    <FLOATVAL>
>    hostA_VarY_2010-05-01_afternoon    <FLOATVAL>
>    .....
>    hostB_VarX_2010-05-01_morning    <FLOATVAL>
>    hostB_VarY_2010-05-01_morning    <FLOATVAL>
>    hostB_VarX_2010-05-01_afternoon    <FLOATVAL>
>    hostB_VarY_2010-05-01_afternoon    <FLOATVAL>
>    .....
>    hostA_VarX_2010-05-02_morning    <FLOATVAL>
>    hostA_VarY_2010-05-02_morning    <FLOATVAL>
>    hostA_VarX_2010-05-02_afternoon    <FLOATVAL>
>    hostA_VarY_2010-05-02_afternoon    <FLOATVAL>
>    .....
>    hostB_VarX_2010-05-02_morning    <FLOATVAL>
>    hostB_VarY_2010-05-02_morning    <FLOATVAL>
>    hostB_VarX_2010-05-02_afternoon    <FLOATVAL>
>    hostB_VarY_2010-05-02_afternoon    <FLOATVAL>
>    .....
>
> But instead of 1 output file I want one output file per day/group. e.g.
>    2010-05-01_morning.txt
>    2010-05-01_afternoon.txt
>
> Each <date>_<time>.txt file would contain all keys/vals for all hosts &
> VarNames
>
> Thanks,
> Alan

--00504502cd119a706a04863f14f5
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Alan,<br><br>You can use MultipleOutputFormat. You can override the gene=
rateFileName...methods to get the functionality you want. <br><br>A partiti=
oner controls how data moves from the mapper to the reducer, so if you take=
 that approach, you will have to specify the number of reducers as the numb=
er of files you want, which is not the best option if some days have more d=
ata than the others. You also dont have control over the file name. See Tom=
 White&#39;s Hadoop The Definitive Guide for an excellent example and usage=
.<br>
=A0 <br clear=3D"all">
Thanks and Regards,<br>Sonal<br><a href=3D"http://www.meghsoft.com" target=
=3D"_blank">www.meghsoft.com</a><br>
<br><br><div class=3D"gmail_quote">On Mon, May 10, 2010 at 5:38 PM, Some Bo=
dy <span dir=3D"ltr">&lt;<a href=3D"mailto:somebody@squareplanet.de" target=
=3D"_blank">somebody@squareplanet.de</a>&gt;</span> wrote:<br><blockquote c=
lass=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, 204, 204); ma=
rgin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Hi,<br>
<br>
I&#39;m trying to understand how to generate multiple outputs in my reducer=
 (using 0.20.2+228).<br>
Do I need MultipleOutput or should I partition my output in the mapper?<br>
<br>
My reducer currently gets key/val input pairs like this which all end up in=
 my part_r_0000 file.<br>
<br>
 =A0 =A0hostA_VarX_2010-05-01_morning =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0hostA_VarY_2010-05-01_morning =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0hostA_VarX_2010-05-01_afternoon =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0hostA_VarY_2010-05-01_afternoon =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0.....<br>
 =A0 =A0hostB_VarX_2010-05-01_morning =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0hostB_VarY_2010-05-01_morning =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0hostB_VarX_2010-05-01_afternoon =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0hostB_VarY_2010-05-01_afternoon =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0.....<br>
 =A0 =A0hostA_VarX_2010-05-02_morning =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0hostA_VarY_2010-05-02_morning =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0hostA_VarX_2010-05-02_afternoon =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0hostA_VarY_2010-05-02_afternoon =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0.....<br>
 =A0 =A0hostB_VarX_2010-05-02_morning =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0hostB_VarY_2010-05-02_morning =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0hostB_VarX_2010-05-02_afternoon =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0hostB_VarY_2010-05-02_afternoon =A0 =A0&lt;FLOATVAL&gt;<br>
 =A0 =A0.....<br>
<br>
But instead of 1 output file I want one output file per day/group. e.g.<br>
 =A0 =A02010-05-01_morning.txt<br>
 =A0 =A02010-05-01_afternoon.txt<br>
<br>
Each &lt;date&gt;_&lt;time&gt;.txt file would contain all keys/vals for all=
 hosts &amp; VarNames<br>
<br>
Thanks,<br>
Alan</blockquote></div><br>

--00504502cd119a706a04863f14f5--