Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of rkothari_iit@hotmail.com
 designates 65.54.190.14 as permitted sender)
Message-ID: <BAY116-W143747E1D7E0961A5EC7F9F4190@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_41e1e4bf-42ae-42d6-a3c0-8500df5d3778_"
From: rakesh kothari <rkothari_iit@hotmail.com>
To: <mapreduce-user@hadoop.apache.org>
Subject: RE: Partitioning Reducer Output
Date: Mon, 5 Apr 2010 16:04:12 -0700
Importance: Normal
In-Reply-To: <102398.14212.qm@web38102.mail.mud.yahoo.com>
References: 
 <4BB9F518.6070003@darose.net>,<102398.14212.qm@web38102.mail.mud.yahoo.com>
MIME-Version: 1.0

--_41e1e4bf-42ae-42d6-a3c0-8500df5d3778_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable


Thanks for the insights.

My use case is more around sending the reducer output to subdirectories rep=
resenting date partitions.

For example if the base reducer output directory is /hdfs/root/reducer/ and=
 if there are two records encountered by reducer and one is timestamped wit=
h date 2010/01/01 and other with date 2010/01/02 then the records are writt=
en to files in directories "/hdfs/root/reducer/2010/01/01" and "/hdfs/root/=
reducer/2010/01/02" respectively.

MultipleTextOutputFormat was designed to support such use cases but its not=
 ported to 0.20.1. I was hoping if there is a workaround.

Thanks=2C
-Rakesh

Date: Mon=2C 5 Apr 2010 08:45:13 -0700
From: erez_katz@yahoo.com
Subject: Re: Partitioning Reducer Output
To: mapreduce-user@hadoop.apache.org

A partitioner can be used to control how keys are distributed across reduce=
rs (overriding the default=20
hash(key)%num_of_reducers behavior)

I think Rakesh is asking about having multiple "types" of output from a sin=
gle map-reduce application.

Each reducer has a tmp work directory on hdfs (pointed by jobconf by mapred=
.work.output.dir or as env var "mapred_work_output_dir if it is a streaming=
 app).
The content of that folder of a reducer that completed successfully is move=
d to the actual output folder of the task.

A reducer can create other files on that folder and provided that there are=
 no name collisions between reducer (meaning if the reducer number is appen=
ded to the file name)=2C then one can have the output folder contain multip=
le types of outputs =2C something
 like

part-00000
part-00001
part-00002
otherType-00000
otherType-00001
otherType-00002

and later on these files can be moved around to other folders...

hope it helps=2C

  Erez Katz


--- On Mon=2C 4/5/10=2C David Rosenstrauch <darose@darose.net> wrote:

From: David Rosenstrauch <darose@darose.net>
Subject: Re: Partitioning Reducer Output
To: mapreduce-user@hadoop.apache.org
Date: Monday=2C April 5=2C 2010=2C 7:35 AM

On 04/02/2010 08:32 PM=2C rakesh kothari wrote:
>
> Hi=2C
>
> What's the best way to partition data generated from Reducer into multipl=
e =3D
> directories in Hadoop 0.20.1. I was thinking of using MultipleTextOutputF=
or=3D
> mat but that's not backward compatible with other API's in this version o=
f
 =3D
> hadoop.
>
> Thanks=2C
> -Rakesh                        =20

Use a partitioner?

http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapreduc=
e/Job.html#setPartitionerClass%28java.lang.Class%29

HTH=2C

DR

 		 	   		 =20
_________________________________________________________________
Hotmail has tools for the New Busy. Search=2C chat and e-mail from your inb=
ox.
http://www.windowslive.com/campaign/thenewbusy?ocid=3DPID28326::T:WLMTAGL:O=
N:WL:en-US:WM_HMP:042010_1=

--_41e1e4bf-42ae-42d6-a3c0-8500df5d3778_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px=3B
padding:0px
}
body.hmmessage
{
font-size: 10pt=3B
font-family:Verdana
}
--></style>
</head>
<body class=3D'hmmessage'>
Thanks for the insights.<br><br>My use case is more around sending the redu=
cer output to subdirectories representing date partitions.<br><br>For examp=
le if the base reducer output directory is /hdfs/root/reducer/ and if there=
 are two records encountered by reducer and one is timestamped with date 20=
10/01/01 and other with date 2010/01/02 then the records are written to fil=
es in directories "/hdfs/root/reducer/2010/01/01" and "/hdfs/root/reducer/2=
010/01/02" respectively.<br><br>MultipleTextOutputFormat was designed to su=
pport such use cases but its not ported to 0.20.1. I was hoping if there is=
 a workaround.<br><br>Thanks=2C<br>-Rakesh<br><br><hr id=3D"stopSpelling">D=
ate: Mon=2C 5 Apr 2010 08:45:13 -0700<br>From: erez_katz@yahoo.com<br>Subje=
ct: Re: Partitioning Reducer Output<br>To: mapreduce-user@hadoop.apache.org=
<br><br><table border=3D"0" cellpadding=3D"0" cellspacing=3D"0"><tbody><tr>=
<td style=3D"font-family: inherit=3B font-style: inherit=3B font-variant: i=
nherit=3B font-weight: inherit=3B font-size: inherit=3B line-height: inheri=
t=3B font-size-adjust: inherit=3B font-stretch: inherit=3B -x-system-font: =
none=3B" valign=3D"top">A partitioner can be used to control how keys are d=
istributed across reducers (overriding the default <br>hash(key)%num_of_red=
ucers behavior)<br><br>I think Rakesh is asking about having multiple "type=
s" of output from a single map-reduce application.<br><br>Each reducer has =
a tmp work directory on hdfs (pointed by jobconf by mapred.work.output.dir =
or as env var "mapred_work_output_dir if it is a streaming app).<br>The con=
tent of that folder of a reducer that completed successfully is moved to th=
e actual output folder of the task.<br><br>A reducer can create other files=
 on that folder and provided that there are no name collisions between redu=
cer (meaning if the reducer number is appended to the file name)=2C then on=
e can have the output folder contain multiple types of outputs =2C somethin=
g
 like<br><br>part-00000<br>part-00001<br>part-00002<br>otherType-00000<br>o=
therType-00001<br>otherType-00002<br><br>and later on these files can be mo=
ved around to other folders...<br><br>hope it helps=2C<br><br>&nbsp=3B Erez=
 Katz<br><br><br>--- On <b>Mon=2C 4/5/10=2C David Rosenstrauch <i>&lt=3Bdar=
ose@darose.net&gt=3B</i></b> wrote:<br><blockquote style=3D"margin-left: 5p=
x=3B padding-left: 5px=3B"><br>From: David Rosenstrauch &lt=3Bdarose@darose=
.net&gt=3B<br>Subject: Re: Partitioning Reducer Output<br>To: mapreduce-use=
r@hadoop.apache.org<br>Date: Monday=2C April 5=2C 2010=2C 7:35 AM<br><br><d=
iv class=3D"ecxplainMail">On 04/02/2010 08:32 PM=2C rakesh kothari wrote:<b=
r>&gt=3B<br>&gt=3B Hi=2C<br>&gt=3B<br>&gt=3B What's the best way to partiti=
on data generated from Reducer into multiple =3D<br>&gt=3B directories in H=
adoop 0.20.1. I was thinking of using MultipleTextOutputFor=3D<br>&gt=3B ma=
t but that's not backward compatible with other API's in this version of
 =3D<br>&gt=3B hadoop.<br>&gt=3B<br>&gt=3B Thanks=2C<br>&gt=3B -Rakesh &nbs=
p=3B&nbsp=3B&nbsp=3B &nbsp=3B&nbsp=3B&nbsp=3B&nbsp=3B&nbsp=3B&nbsp=3B&nbsp=
=3B&nbsp=3B &nbsp=3B&nbsp=3B&nbsp=3B&nbsp=3B&nbsp=3B&nbsp=3B &nbsp=3B&nbsp=
=3B&nbsp=3B <br><br>Use a partitioner?<br><br><a href=3D"http://hadoop.apac=
he.org/common/docs/r0.20.1/api/org/apache/hadoop/mapreduce/Job.html#setPart=
itionerClass%28java.lang.Class%29">http://hadoop.apache.org/common/docs/r0.=
20.1/api/org/apache/hadoop/mapreduce/Job.html#setPartitionerClass%28java.la=
ng.Class%29</a><br><br>HTH=2C<br><br>DR<br></div></blockquote></td></tr></t=
body></table><br> 		 	   		  <br /><hr />Hotmail has tools for the New Busy=
. Search=2C chat and e-mail from your inbox. <a href=3D'http://www.windowsl=
ive.com/campaign/thenewbusy?ocid=3DPID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:0=
42010_1' target=3D'_new'>Learn more.</a></body>
</html>=

--_41e1e4bf-42ae-42d6-a3c0-8500df5d3778_--