Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 40679 invoked from network); 5 Apr 2010 23:04:42 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 5 Apr 2010 23:04:42 -0000 Received: (qmail 74639 invoked by uid 500); 5 Apr 2010 23:04:42 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 74614 invoked by uid 500); 5 Apr 2010 23:04:42 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 74606 invoked by uid 99); 5 Apr 2010 23:04:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Apr 2010 23:04:41 +0000 X-ASF-Spam-Status: No, hits=4.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rkothari_iit@hotmail.com designates 65.54.190.14 as permitted sender) Received: from [65.54.190.14] (HELO bay0-omc1-s3.bay0.hotmail.com) (65.54.190.14) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Apr 2010 23:04:33 +0000 Received: from BAY116-W14 ([65.54.190.61]) by bay0-omc1-s3.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.3959); Mon, 5 Apr 2010 16:04:12 -0700 Message-ID: Content-Type: multipart/alternative; boundary="_41e1e4bf-42ae-42d6-a3c0-8500df5d3778_" X-Originating-IP: [207.171.180.101] From: rakesh kothari To: Subject: RE: Partitioning Reducer Output Date: Mon, 5 Apr 2010 16:04:12 -0700 Importance: Normal In-Reply-To: <102398.14212.qm@web38102.mail.mud.yahoo.com> References: <4BB9F518.6070003@darose.net>,<102398.14212.qm@web38102.mail.mud.yahoo.com> MIME-Version: 1.0 X-OriginalArrivalTime: 05 Apr 2010 23:04:12.0751 (UTC) FILETIME=[4E8E95F0:01CAD514] X-Virus-Checked: Checked by ClamAV on apache.org --_41e1e4bf-42ae-42d6-a3c0-8500df5d3778_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Thanks for the insights. My use case is more around sending the reducer output to subdirectories rep= resenting date partitions. For example if the base reducer output directory is /hdfs/root/reducer/ and= if there are two records encountered by reducer and one is timestamped wit= h date 2010/01/01 and other with date 2010/01/02 then the records are writt= en to files in directories "/hdfs/root/reducer/2010/01/01" and "/hdfs/root/= reducer/2010/01/02" respectively. MultipleTextOutputFormat was designed to support such use cases but its not= ported to 0.20.1. I was hoping if there is a workaround. Thanks=2C -Rakesh Date: Mon=2C 5 Apr 2010 08:45:13 -0700 From: erez_katz@yahoo.com Subject: Re: Partitioning Reducer Output To: mapreduce-user@hadoop.apache.org A partitioner can be used to control how keys are distributed across reduce= rs (overriding the default=20 hash(key)%num_of_reducers behavior) I think Rakesh is asking about having multiple "types" of output from a sin= gle map-reduce application. Each reducer has a tmp work directory on hdfs (pointed by jobconf by mapred= .work.output.dir or as env var "mapred_work_output_dir if it is a streaming= app). The content of that folder of a reducer that completed successfully is move= d to the actual output folder of the task. A reducer can create other files on that folder and provided that there are= no name collisions between reducer (meaning if the reducer number is appen= ded to the file name)=2C then one can have the output folder contain multip= le types of outputs =2C something like part-00000 part-00001 part-00002 otherType-00000 otherType-00001 otherType-00002 and later on these files can be moved around to other folders... hope it helps=2C Erez Katz --- On Mon=2C 4/5/10=2C David Rosenstrauch wrote: From: David Rosenstrauch Subject: Re: Partitioning Reducer Output To: mapreduce-user@hadoop.apache.org Date: Monday=2C April 5=2C 2010=2C 7:35 AM On 04/02/2010 08:32 PM=2C rakesh kothari wrote: > > Hi=2C > > What's the best way to partition data generated from Reducer into multipl= e =3D > directories in Hadoop 0.20.1. I was thinking of using MultipleTextOutputF= or=3D > mat but that's not backward compatible with other API's in this version o= f =3D > hadoop. > > Thanks=2C > -Rakesh =20 Use a partitioner? http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapreduc= e/Job.html#setPartitionerClass%28java.lang.Class%29 HTH=2C DR =20 _________________________________________________________________ Hotmail has tools for the New Busy. Search=2C chat and e-mail from your inb= ox. http://www.windowslive.com/campaign/thenewbusy?ocid=3DPID28326::T:WLMTAGL:O= N:WL:en-US:WM_HMP:042010_1= --_41e1e4bf-42ae-42d6-a3c0-8500df5d3778_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Thanks for the insights.

My use case is more around sending the redu= cer output to subdirectories representing date partitions.

For examp= le if the base reducer output directory is /hdfs/root/reducer/ and if there= are two records encountered by reducer and one is timestamped with date 20= 10/01/01 and other with date 2010/01/02 then the records are written to fil= es in directories "/hdfs/root/reducer/2010/01/01" and "/hdfs/root/reducer/2= 010/01/02" respectively.

MultipleTextOutputFormat was designed to su= pport such use cases but its not ported to 0.20.1. I was hoping if there is= a workaround.

Thanks=2C
-Rakesh


D= ate: Mon=2C 5 Apr 2010 08:45:13 -0700
From: erez_katz@yahoo.com
Subje= ct: Re: Partitioning Reducer Output
To: mapreduce-user@hadoop.apache.org=

=
A partitioner can be used to control how keys are d= istributed across reducers (overriding the default
hash(key)%num_of_red= ucers behavior)

I think Rakesh is asking about having multiple "type= s" of output from a single map-reduce application.

Each reducer has = a tmp work directory on hdfs (pointed by jobconf by mapred.work.output.dir = or as env var "mapred_work_output_dir if it is a streaming app).
The con= tent of that folder of a reducer that completed successfully is moved to th= e actual output folder of the task.

A reducer can create other files= on that folder and provided that there are no name collisions between redu= cer (meaning if the reducer number is appended to the file name)=2C then on= e can have the output folder contain multiple types of outputs =2C somethin= g like

part-00000
part-00001
part-00002
otherType-00000
o= therType-00001
otherType-00002

and later on these files can be mo= ved around to other folders...

hope it helps=2C

 =3B Erez= Katz


--- On Mon=2C 4/5/10=2C David Rosenstrauch <=3Bdar= ose@darose.net>=3B wrote:

From: David Rosenstrauch <=3Bdarose@darose= .net>=3B
Subject: Re: Partitioning Reducer Output
To: mapreduce-use= r@hadoop.apache.org
Date: Monday=2C April 5=2C 2010=2C 7:35 AM

On 04/02/2010 08:32 PM=2C rakesh kothari wrote:>=3B
>=3B Hi=2C
>=3B
>=3B What's the best way to partiti= on data generated from Reducer into multiple =3D
>=3B directories in H= adoop 0.20.1. I was thinking of using MultipleTextOutputFor=3D
>=3B ma= t but that's not backward compatible with other API's in this version of =3D
>=3B hadoop.
>=3B
>=3B Thanks=2C
>=3B -Rakesh &nbs= p=3B =3B =3B  =3B =3B =3B =3B =3B =3B = =3B =3B  =3B =3B =3B =3B =3B =3B  =3B = =3B =3B

Use a partitioner?

http://hadoop.apache.org/common/docs/r0.= 20.1/api/org/apache/hadoop/mapreduce/Job.html#setPartitionerClass%28java.la= ng.Class%29

HTH=2C

DR



Hotmail has tools for the New Busy= . Search=2C chat and e-mail from your inbox. Learn more. = --_41e1e4bf-42ae-42d6-a3c0-8500df5d3778_--