Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 30142 invoked from network); 21 Jan 2010 12:17:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 21 Jan 2010 12:17:35 -0000 Received: (qmail 82254 invoked by uid 500); 21 Jan 2010 12:17:34 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 82207 invoked by uid 500); 21 Jan 2010 12:17:34 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 82194 invoked by uid 99); 21 Jan 2010 12:17:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Jan 2010 12:17:34 +0000 X-ASF-Spam-Status: No, hits=4.2 required=10.0 tests=HTML_MESSAGE,NO_RDNS_DOTCOM_HELO,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [69.147.107.20] (HELO mrout1-b.corp.re1.yahoo.com) (69.147.107.20) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Jan 2010 12:17:22 +0000 Received: from EGL-EX07CAS02.ds.corp.yahoo.com (egl-ex07cas02.eglbp.corp.yahoo.com [203.83.248.209]) by mrout1-b.corp.re1.yahoo.com (8.13.8/8.13.8/y.out) with ESMTP id o0LCGhwR017921 for ; Thu, 21 Jan 2010 04:16:44 -0800 (PST) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=received:from:to:date:subject:thread-topic:thread-index: message-id:in-reply-to:accept-language:content-language: x-ms-has-attach:x-ms-tnef-correlator:acceptlanguage:content-type:mime-version; b=qCpLUKrG1goBkylALmh7D5DhszmKKgqi2M/Bw1efP0ZJ/PhFs+0kizf40mNybjjF Received: from EGL-EX07VS01.ds.corp.yahoo.com ([203.83.248.205]) by EGL-EX07CAS02.ds.corp.yahoo.com ([203.83.248.216]) with mapi; Thu, 21 Jan 2010 17:46:43 +0530 From: Amogh Vasekar To: "mapreduce-user@hadoop.apache.org" Date: Thu, 21 Jan 2010 17:46:40 +0530 Subject: Re: chained mappers & reducers Thread-Topic: chained mappers & reducers Thread-Index: AcqZSkQZ3t2WgPCyR7KP8SgkOAVf5QAUvH/qABkKY8AAJI2UFg== Message-ID: In-Reply-To: <23E512539066824E9B836EA8709831A9060380A6@SM-CALA-VXMB03C.swna.wdpr.disney.com> Accept-Language: en-US Content-Language: en X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_C77E438065A2amoghyahooinccom_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_C77E438065A2amoghyahooinccom_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Unless you can somehow guarantee that a certain output key K1 comes only fr= om reducer R1 ( seems very unlikely & somewhat useless in your case ) , I'm= afraid you'll need to have a subsequent MR job. The thing is Hadoop has no= "in-built" mechanism for reducers to exchange data :) Amogh On 1/21/10 12:30 AM, "Clements, Michael" wrot= e: The use case is this: M1-R1-R2 M1: generate K1-V1 pairs from input R1: group by K1, generate new Keys K2 from group, with value V2, a count M2: identity pass-through R2: sum counts by K2 In short, R1 does this: groups data by the K1 defined by M1 emits new keys K2, derived from the group it built each key K2 has a count R2 sums the counts for each K2 The output of R1 could be fed directly into R2. But I can't find a way to d= o that in Hadoop. So I have to create a second job, which has to have a Map= phase, so I create a pass-through mapper. This works but it has a lot of o= verhead. It would be faster & cleaner to run R1 directly into R2 within the= same job - if possible. From: mapreduce-user-return-302-Michael.Clements=3Ddisney.com@hadoop.apache= .org [mailto:mapreduce-user-return-302-Michael.Clements=3Ddisney.com@hadoop= .apache.org] On Behalf Of Amogh Vasekar Sent: Tuesday, January 19, 2010 10:53 PM To: mapreduce-user@hadoop.apache.org Subject: Re: chained mappers & reducers Hi, Can you elaborate on your case a little? If you need sort and shuffle ( ie outputs of different reducer tasks of R1 = to be aggregated in some way ) , you have to write another map-red job. If = you need to process only local reducer data ( ie your reducer output key is= same as input key ), your job would be M1-R1-M2. Essentially in Hadoop, y= ou can have one sort and shuffle phase in one job. Note that chain APIs are for jobs of the form (M+RM*). Amogh On 1/20/10 2:29 AM, "Clements, Michael" wrote= : These two classes are not really symmetric as the name suggests. ChainedMapper does what I expected: chains multiple map steps. But ChainedReducer does not chain reducer steps. It chains map steps to follow a reduce step. At least, that is my understanding given the API docs & examples I've read. Is there a way to chain multiple reducer steps? I've got a job that needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2, where M2 is an identity pass-through mapper. If there were a way to chain 2 reduce steps the way ChainedMapper chains map steps, I could make this into a one-pass job, eliminating the overhead of a second job and all the unnecessary I/O. Thanks Michael Clements Solutions Architect michael.clements@disney.com 206 664-4374 office 360 317 5051 mobile --_000_C77E438065A2amoghyahooinccom_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: chained mappers & reducers Unless you can somehow guarantee that a certain output key K1 comes o= nly from reducer R1 ( seems very unlikely & somewhat useless in your ca= se ) , I’m afraid you’ll need to have a subsequent MR job. The = thing is Hadoop has no “in-built” mechanism for reducers to exc= hange data :)

Amogh


On 1/21/10 12:30 AM, "Clements, Michael" <Michael.Clements@disney.com> wrote:

The use case is this: M1-R= 1-R2
 
M1: generate K1-V1 pairs from input
R1: group by K1, generate new Keys K2 from group, with value V2, a count  
M2: identity pass-through
R2: sum counts by K2
 
In short, R1 does this:
groups data by the K1 defined by M1
emits new keys K2, derived from the group it built
each key K2 has a count
 
R2 sums the counts for each K2
 
The output of R1 could be fed directly into R2. But I can’t find a wa= y to do that in Hadoop. So I have to create a second job, which has to have= a Map phase, so I create a pass-through mapper. This works but it has a lo= t of overhead. It would be faster & cleaner to run R1 directly into R2 = within the same job – if possible.
 
 

From: mapreduce-user-return-302-Michael.Clements=3Ddisney.com@hadoop.apache.o= rg [mailto:mapreduce-user-return-302-Michael.Clements= =3Ddisney.com@hadoop.apache.org] On Behalf Of Amogh Vasekar
Sent: Tuesday, January 19, 2010 10:53 PM
To: mapreduce-user@hado= op.apache.org
Subject: Re: chained mappers & reducers

Hi,
Can you elaborate on your case a little?
If you need sort and shuffle ( ie outputs of different reducer tasks of R1 = to be aggregated in some way ) , you have to write another map-red job. If = you need to process only local reducer data ( ie your reducer output key is= same as input key ),  your job would be M1-R1-M2. Essentially in Hado= op, you can have one sort and shuffle phase in one job.
Note that chain APIs are for jobs of the form (M+RM*).  

Amogh


On 1/20/10 2:29 AM, "Clements, Michael" <Michael.Clements@disney.com> wrote:
These two classes are not really symmetric as the name suggests.
ChainedMapper does what I expected: chains multiple map steps. But
ChainedReducer does not chain reducer steps. It chains map steps to
follow a reduce step. At least, that is my understanding given the API
docs & examples I've read.

Is there a way to chain multiple reducer steps? I've got a job that
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,
where M2 is an identity pass-through mapper. If there were a way to
chain 2 reduce steps the way ChainedMapper chains map steps, I could
make this into a one-pass job, eliminating the overhead of a second job
and all the unnecessary I/O.

Thanks

Michael Clements
Solutions Architect
michael.clements@disney.com
206 664-4374 office
360 317 5051 mobile



--_000_C77E438065A2amoghyahooinccom_--