Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=received:from:to:date:subject:thread-topic:thread-index:
	message-id:in-reply-to:accept-language:content-language:
	x-ms-has-attach:x-ms-tnef-correlator:acceptlanguage:content-type:mime-version;
	b=JdvK9IMmCKgWWxuCclC261Q2C3lmlDh4fKZi6sE/fAp/+Fm7Ax9fUfBoUUA04cxx
From: Amogh Vasekar <amogh@yahoo-inc.com>
To: "mapreduce-user@hadoop.apache.org" <mapreduce-user@hadoop.apache.org>
Date: Tue, 27 Jul 2010 17:04:27 +0530
Subject: Re: Pipelining Mappers and Reducers
Thread-Topic: Pipelining Mappers and Reducers
Thread-Index: Acste/VkvAn5niiAQ/qZDfJ9L9eI6QAA7YBL
Message-ID: <C874C21B.F20F%amogh@yahoo-inc.com>
In-Reply-To: <AANLkTikG+ER0jHY0qL1=1s+wNurD8n+b4JJ5G3oujjpN@mail.gmail.com>
Accept-Language: en-US
Content-Language: en
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_C874C21BF20Famoghyahooinccom_"
MIME-Version: 1.0

--_000_C874C21BF20Famoghyahooinccom_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hi,
>>What would really be great for me is if I could have the Reducer start pr=
ocessing the map outputs as they are ready, and not after all Mappers finis=
h
Check the property mapred.reduce.slowstart.completed.maps

>>I've read about chaining mappers, but to the best of my understanding the=
 second line of Mappers will only start after the first ones finished. Am I=
 correct?
Not exactly, all the transformations will be in one go till the reduce barr=
ier reaches

>>Someone also hinted to me that I could write a Combiner that Hadoop might=
 invoke on the Reducer's side when Mappers finish,
Combiners can be run on both map side and reduce side as soon as the buffer=
 is full ( many configuration properties control this ), and work well when=
 your reduce operations are not something like say, average.

HTH,
Amogh

On 7/27/10 4:36 PM, "Shai Erera" <serera@gmail.com> wrote:

Hi

I have a scenario for which I'd like to write a MR job in which Mappers do =
some work and eventually the output of all mappers need to be combined by a=
 single Reducer. Each Mapper outputs <key,value> that is distinct from all =
other Mappers, meaning the Reducer.reduce() method always receives a single=
 element in the values argument of a specific key. Really - the Mappers are=
 independent of each others in their output.

What would really be great for me is if I could have the Reducer start proc=
essing the map outputs as they are ready, and not after all Mappers finish.=
 For example, I'm processing a very large data set and the MR framework spa=
wns hundreds of Mappers for the task. The output of all Mappers though is r=
equired to be processed by 1 Reducer. It so happens to be that the Reducer =
job is very heavy, compared to the Mappers, and while all Mappers finish in=
 about 7 minutes (total wall clock time), the Reducer takes ~30 minutes.

In my cluster I can run 96 Mappers in parallel, so I'm pretty sure that if =
I could streamline the outputs of the Mappers to the Reducer, I could gain =
some cycles back - I can easily limit the number of Mappers to say 95 and h=
ave the Reducer constantly doing some job.

I've read about chaining mappers, but to the best of my understanding the s=
econd line of Mappers will only start after the first ones finished. Am I c=
orrect?

Someone also hinted to me that I could write a Combiner that Hadoop might i=
nvoke on the Reducer's side when Mappers finish, if say the data of the Map=
pers is very large and cannot be kept in RAM. I haven't tried it yet, so if=
 anyone can confirm this will indeed work, I'm willing to give it a try. Th=
e output of the Mappers is very large, and therefore they already write it =
directly to disk. So I'd like to avoid doing this serialization twice (once=
 when the Mapper works, and the second time when Hadoop will *flush* the Re=
ducer's buffer - or whatever the right terminology is).

I apologize if this has been raised before - if it has, could you please po=
int me at the relevant discussion/issue?

Shai


--_000_C874C21BF20Famoghyahooinccom_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML>
<HEAD>
<TITLE>Re: Pipelining Mappers and Reducers</TITLE>
</HEAD>
<BODY>
<FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"><SPAN STYLE=3D'font-size:=
14pt'>Hi,<BR>
&gt;&gt;What would really be great for me is if I could have the Reducer st=
art processing the map outputs as they are ready, and not after all Mappers=
 finish<BR>
Check the property mapred.reduce.slowstart.completed.maps <BR>
<BR>
&gt;&gt;I've read about chaining mappers, but to the best of my understandi=
ng the second line of Mappers will only start after the first ones finished=
. Am I correct?<BR>
Not exactly, all the transformations will be in one go till the reduce barr=
ier reaches<BR>
<BR>
&gt;&gt;Someone also hinted to me that I could write a Combiner that Hadoop=
 might invoke on the Reducer's side when Mappers finish,<BR>
Combiners can be run on both map side and reduce side as soon as the buffer=
 is full ( many configuration properties control this ), and work well when=
 your reduce operations are not something like say, average. &nbsp;<BR>
<BR>
HTH,<BR>
Amogh<BR>
<BR>
On 7/27/10 4:36 PM, &quot;Shai Erera&quot; &lt;<a href=3D"serera@gmail.com"=
>serera@gmail.com</a>&gt; wrote:<BR>
<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"=
><SPAN STYLE=3D'font-size:14pt'>Hi<BR>
<BR>
I have a scenario for which I'd like to write a MR job in which Mappers do =
some work and eventually the output of all mappers need to be combined by a=
 single Reducer. Each Mapper outputs &lt;key,value&gt; that is distinct fro=
m all other Mappers, meaning the Reducer.reduce() method always receives a =
single element in the values argument of a specific key. Really - the Mappe=
rs are independent of each others in their output.<BR>
<BR>
What would really be great for me is if I could have the Reducer start proc=
essing the map outputs as they are ready, and not after all Mappers finish.=
 For example, I'm processing a very large data set and the MR framework spa=
wns hundreds of Mappers for the task. The output of all Mappers though is r=
equired to be processed by 1 Reducer. It so happens to be that the Reducer =
job is very heavy, compared to the Mappers, and while all Mappers finish in=
 about 7 minutes (total wall clock time), the Reducer takes ~30 minutes.<BR=
>
<BR>
In my cluster I can run 96 Mappers in parallel, so I'm pretty sure that if =
I could streamline the outputs of the Mappers to the Reducer, I could gain =
some cycles back - I can easily limit the number of Mappers to say 95 and h=
ave the Reducer constantly doing some job.<BR>
<BR>
I've read about chaining mappers, but to the best of my understanding the s=
econd line of Mappers will only start after the first ones finished. Am I c=
orrect?<BR>
<BR>
Someone also hinted to me that I could write a Combiner that Hadoop might i=
nvoke on the Reducer's side when Mappers finish, if say the data of the Map=
pers is very large and cannot be kept in RAM. I haven't tried it yet, so if=
 anyone can confirm this will indeed work, I'm willing to give it a try. Th=
e output of the Mappers is very large, and therefore they already write it =
directly to disk. So I'd like to avoid doing this serialization twice (once=
 when the Mapper works, and the second time when Hadoop will *flush* the Re=
ducer's buffer - or whatever the right terminology is).<BR>
<BR>
I apologize if this has been raised before - if it has, could you please po=
int me at the relevant discussion/issue?<BR>
<BR>
Shai<BR>
<BR>
</SPAN></FONT></BLOCKQUOTE>
</BODY>
</HTML>


--_000_C874C21BF20Famoghyahooinccom_--