Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of eagleeye83dp@gmail.com
 designates 209.85.216.191 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=jA+7X7B7VOFlypKl5RimMJx+WXxCgEevz4jvEpOfjiPlN3Xt6Z9zc4LXj8MWkJdJ3b
         NNb9I0ZQP5n4zbQ4/WZRDEMZLxf//x0Nt+UdIRccDmngcPTC3DvyUGJas3I2BQjccQt9
         mCqUsKOqBGrSDi4az/jZ+NYzXBOz8SlR0mWi4=
MIME-Version: 1.0
In-Reply-To: 
 <616DA47B2EF5B944B91846785B512FF4CFADEA7102@EGL-EX07VS01.ds.corp.yahoo.com>
References: <c139ce50908232047gb28f6e4w66b059b3d83fbab5@mail.gmail.com>
	 <616DA47B2EF5B944B91846785B512FF4CFADEA70DA@EGL-EX07VS01.ds.corp.yahoo.com>
	 <c139ce50908232333u47b365dh54fa8e206c1160fa@mail.gmail.com>
	 <616DA47B2EF5B944B91846785B512FF4CFADEA7102@EGL-EX07VS01.ds.corp.yahoo.com>
Date: Thu, 27 Aug 2009 11:23:49 +0800
Message-ID: <c139ce50908262023o1b6f2c9ahb6d7c02ddfadbb08@mail.gmail.com>
Subject: Re: Location reduce task running.
From: fan wei fang <eagleeye83dp@gmail.com>
To: mapreduce-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0016e648f6769606fd0472171825

--0016e648f6769606fd0472171825
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Hi Amogh,
Thank you for your constructive opinion.
The problem is that, data for these filters is very huge. The performance
may be so poor if it is moved around.
I am thinking of another workaround.

Instead of starting a new reduce task for an incoming email, I will let the
reduce task live as long as possible,  and use locally cached filter  to
process several emails.
But there's another problem with this method. As far as I read, reduce task
won't start until all map tasks finish. I think the way Hadoop work is
batch-processing fashion, it gathers a large amount of data and process it
at once. But what I want is something like stream-processing fashion. Each
map's output (k,v) is immediately transferred to and processed by reduce
node. In other words, reduce node starts right after it receives first
intermediate (k,v) and stay alive waiting for sub-sequence (k,v)'s.

Is there any way to force Hadoop to work in this stream-processing fashion?
I think there's one way that is messing up with Hadoop code at shuffling
stage but I think it should be the last choice.

Regards.
Frank.


On Mon, Aug 24, 2009 at 5:22 PM, Amogh Vasekar <amogh@yahoo-inc.com> wrote

>  >> In order to achieve efficiency, I don't want these pieces of spam
> filters moving around the nodes in cluster.
>
> If you are flexible on this, you can pass both mails and config data to
> mappers, do common processing for mails, transform the K,V pair for each
> user/mailbox , and use custom partitioner, comparator to pass user-specif=
ic
> mails and filters to a single reducer and process as needed. If the size =
of
> config file is << than mail sizes ( maybe na=EFve, but should hold good )=
, its
> not much of an inefficiency. This **should**  be better than 2 mapred job=
s
> , where you would be writing twice to the hdfs.
>
> Hope this helps, just the first thing that came to my mind.
>
>
>
> Thanks,
>
> Amogh
>
>
>  ------------------------------
>
> *From:* fan wei fang [mailto:eagleeye83dp@gmail.com]
> *Sent:* Monday, August 24, 2009 12:03 PM
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* Re: Location reduce task running.
>
>
>
> Hi Amogh,
>
> I appreciate your quick response.
> Please correct me if I'm wrong. If the workload of reducers is transferre=
d
> to combiners, does it mean every map node must hold a copy of my config.
> data? If this is the case, it is completely unacceptable for my app.
>
> Let me further explain the situation for you.
> I am trying to build an anti-spam system using Map-Reduce. In this system=
,
> users are allowed to have their own spam filters. The whole set of these
> filters are so huge that it shouldn't be put in any single node. Therefor=
e,
> I have to split them to nodes. Each node will be responsible for only a
> small number email boxes.
> In order to achieve efficiency, I don't want these pieces of spam filters
> moving around the nodes in cluster.
>
> This is the data flow of my app.
>
> Mails ---> Map (do common processing for emails) ---> Reduce (do
> user-specific processing) ---> Store mails to designated boxes.
>
> Do you have any suggestion? I am thinking about JVM re-use feature of
> Hadoop or I can set up a chain of two map-reduce pairs.
>
> Best regards.
> Fang.
>
>
>  On Mon, Aug 24, 2009 at 1:25 PM, Amogh Vasekar <amogh@yahoo-inc.com>
> wrote:
>
> No, but if you want a =93reducer like=94 functionality on the same node, =
have a
> look at combiners. To get exact functionality you might need to tweak aro=
und
> a little wrt buffers, flush etc.
>
>
>
> Cheers!
>
> Amogh
>
>
>  ------------------------------
>
> *From:* fan wei fang [mailto:eagleeye83dp@gmail.com]
> *Sent:* Monday, August 24, 2009 9:17 AM
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* Location reduce task running.
>
>
>
> Hello guys,
>
> I am a newbie of Hadoop and doing an experiment with Hadoop.
> My situation is:
>  +My job is expected to run continuously/frequently
>  +My reduce task require a large amount of configuration data. This confi=
g
> data is specific to map output's key.
> -->That's why, I want to avoid moving this config data around.
> As far as I read, nodes where reduce tasks are assigned are picked withou=
t
> consideration of data locality.
>
> My question is: Is there any way to force the reduce tasks for a specific
> key running on the same node?
>
> Thnx.
>
>
>

--0016e648f6769606fd0472171825
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Hi Amogh,<br>Thank you for your constructive opinion. <br>The problem is th=
at, data for these filters is very huge. The performance may be so poor if =
it is moved around.<br>I am thinking of another workaround. <br><br>Instead=
 of starting a new reduce task for an incoming email, I will let the reduce=
 task live as long as possible,=A0 and use locally cached filter=A0 to proc=
ess several emails.<br>
But there&#39;s another problem with this method. As far as I read, reduce =
task won&#39;t start until all map tasks finish. I think the way Hadoop wor=
k is batch-processing fashion, it gathers a large amount of data and proces=
s it at once. But what I want is something like stream-processing fashion. =
Each map&#39;s output (k,v) is immediately transferred to and processed by =
reduce node. In other words, reduce node starts right after it receives fir=
st intermediate (k,v) and stay alive waiting for sub-sequence (k,v)&#39;s.<=
br>
<br>Is there any way to force Hadoop to work in this stream-processing fash=
ion? <br>I think there&#39;s one way that is messing up with Hadoop code at=
 shuffling stage but I think it should be the last choice.<br><br>Regards.<=
br>
Frank.<br><br><br><div class=3D"gmail_quote">On Mon, Aug 24, 2009 at 5:22 P=
M, Amogh Vasekar <span dir=3D"ltr">&lt;<a href=3D"mailto:amogh@yahoo-inc.co=
m">amogh@yahoo-inc.com</a>&gt;</span> wrote<br><blockquote class=3D"gmail_q=
uote" style=3D"border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0=
pt 0.8ex; padding-left: 1ex;">


<div link=3D"blue" vlink=3D"blue" lang=3D"EN-US">

<div><div class=3D"im">

<p><font color=3D"navy" face=3D"Arial" size=3D"2"><span style=3D"font-size:=
 10pt; font-family: Arial; color: navy;">&gt;&gt;</span></font> In order to=
 achieve
efficiency, I don&#39;t want these pieces of spam filters moving around the=
 nodes
in cluster.</p>

</div><p><font face=3D"Times New Roman" size=3D"3"><span style=3D"font-size=
: 12pt;">If you are flexible on this, you can pass both mails and config da=
ta to
mappers, do common processing for mails, transform the K,V pair for each us=
er/mailbox
, and use custom partitioner, comparator to pass user-specific mails and
filters to a single reducer and process as needed. If the size of config fi=
le
is &lt;&lt; than mail sizes ( maybe na=EFve, but should hold good ), its no=
t much
of an inefficiency. This *<b><span style=3D"font-weight: bold;">should</spa=
n></b>* =A0be
better than 2 mapred jobs , where you would be writing twice to the hdfs.</=
span></font></p>

<p><font face=3D"Times New Roman" size=3D"3"><span style=3D"font-size: 12pt=
;">Hope this helps, just the first thing that came to my mind.</span></font=
></p>

<p><font face=3D"Times New Roman" size=3D"3"><span style=3D"font-size: 12pt=
;">=A0</span></font></p>

<p><font face=3D"Times New Roman" size=3D"3"><span style=3D"font-size: 12pt=
;">Thanks,</span></font></p>

<p><font face=3D"Times New Roman" size=3D"3"><span style=3D"font-size: 12pt=
;">Amogh</span></font></p>

<p><font color=3D"navy" face=3D"Arial" size=3D"2"><span style=3D"font-size:=
 10pt; font-family: Arial; color: navy;">=A0</span></font></p>

<div>

<div style=3D"text-align: center;" align=3D"center"><font face=3D"Times New=
 Roman" size=3D"3"><span style=3D"font-size: 12pt;">

<hr width=3D"100%" align=3D"center" size=3D"2">

</span></font></div>

<p><b><font face=3D"Tahoma" size=3D"2"><span style=3D"font-size: 10pt; font=
-family: Tahoma; font-weight: bold;">From:</span></font></b><font face=3D"T=
ahoma" size=3D"2"><span style=3D"font-size: 10pt; font-family: Tahoma;"> fa=
n wei fang
[mailto:<a href=3D"mailto:eagleeye83dp@gmail.com" target=3D"_blank">eagleey=
e83dp@gmail.com</a>] <br>
<b><span style=3D"font-weight: bold;">Sent:</span></b> Monday, August 24, 2=
009
12:03 PM<div class=3D"im"><br>
<b><span style=3D"font-weight: bold;">To:</span></b> <a href=3D"mailto:mapr=
educe-user@hadoop.apache.org" target=3D"_blank">mapreduce-user@hadoop.apach=
e.org</a><br>
</div><b><span style=3D"font-weight: bold;">Subject:</span></b> Re: Locatio=
n reduce task
running.</span></font></p>

</div><div><div></div><div class=3D"h5">

<p><font face=3D"Times New Roman" size=3D"3"><span style=3D"font-size: 12pt=
;">=A0</span></font></p>

<p style=3D"margin-bottom: 12pt;"><font face=3D"Times New Roman" size=3D"3"=
><span style=3D"font-size: 12pt;">Hi </span></font><font color=3D"navy" fac=
e=3D"Arial" size=3D"2"><span style=3D"font-size: 10pt; font-family: Arial; =
color: navy;">Amogh</span></font>, <br>

<br>
I appreciate your quick response. <br>
Please correct me if I&#39;m wrong. If the workload of reducers is transfer=
red to
combiners, does it mean every map node must hold a copy of my config. data?=
 If
this is the case, it is completely unacceptable for my app. <br>
<br>
Let me further explain the situation for you. <br>
I am trying to build an anti-spam system using Map-Reduce. In this system,
users are allowed to have their own spam filters. The whole set of these
filters are so huge that it shouldn&#39;t be put in any single node. Theref=
ore, I
have to split them to nodes. Each node will be responsible for only a small
number email boxes.<br>
In order to achieve efficiency, I don&#39;t want these pieces of spam filte=
rs
moving around the nodes in cluster.<br>
<br>
This is the data flow of my app.<br>
<br>
Mails ---&gt; Map (do common processing for emails) ---&gt; Reduce (do
user-specific processing) ---&gt; Store mails to designated boxes.<br>
<br>
Do you have any suggestion? I am thinking about JVM re-use feature of Hadoo=
p or
I can set up a chain of two map-reduce pairs.<br>
<br>
Best regards.<br>
Fang.<br>
<br>
<br>
</p>

<div>

<p><font face=3D"Times New Roman" size=3D"3"><span style=3D"font-size: 12pt=
;">On Mon, Aug 24, 2009 at 1:25 PM, Amogh Vasekar &lt;<a href=3D"mailto:amo=
gh@yahoo-inc.com" target=3D"_blank">amogh@yahoo-inc.com</a>&gt; wrote:</spa=
n></font></p>


<div link=3D"blue" vlink=3D"purple">

<div>

<p><font color=3D"navy" face=3D"Arial" size=3D"2"><span style=3D"font-size:=
 10pt; font-family: Arial; color: navy;">No, but if you want a =93reducer l=
ike=94
functionality on the same node, have a look at combiners. To get exact
functionality you might need to tweak around a little wrt buffers, flush et=
c.</span></font></p>

<p><font color=3D"navy" face=3D"Arial" size=3D"2"><span style=3D"font-size:=
 10pt; font-family: Arial; color: navy;">=A0</span></font></p>

<p><font color=3D"navy" face=3D"Arial" size=3D"2"><span style=3D"font-size:=
 10pt; font-family: Arial; color: navy;">Cheers!</span></font></p>

<p><font color=3D"navy" face=3D"Arial" size=3D"2"><span style=3D"font-size:=
 10pt; font-family: Arial; color: navy;">Amogh</span></font></p>

<p><font color=3D"navy" face=3D"Arial" size=3D"2"><span style=3D"font-size:=
 10pt; font-family: Arial; color: navy;">=A0</span></font></p>

<div>

<div style=3D"text-align: center;" align=3D"center"><font face=3D"Times New=
 Roman" size=3D"3"><span style=3D"font-size: 12pt;">

<hr width=3D"100%" align=3D"center" size=3D"2">

</span></font></div>

<p><b><font face=3D"Tahoma" size=3D"2"><span style=3D"font-size: 10pt; font=
-family: Tahoma; font-weight: bold;">From:</span></font></b><font face=3D"T=
ahoma" size=3D"2"><span style=3D"font-size: 10pt; font-family: Tahoma;"> fa=
n wei fang [mailto:<a href=3D"mailto:eagleeye83dp@gmail.com" target=3D"_bla=
nk">eagleeye83dp@gmail.com</a>]
<br>
<b><span style=3D"font-weight: bold;">Sent:</span></b> Monday, August 24, 2=
009 9:17
AM<br>
<b><span style=3D"font-weight: bold;">To:</span></b> <a href=3D"mailto:mapr=
educe-user@hadoop.apache.org" target=3D"_blank">mapreduce-user@hadoop.apach=
e.org</a><br>
<b><span style=3D"font-weight: bold;">Subject:</span></b> Location reduce t=
ask
running.</span></font></p>

</div>

<div>

<div>

<p><font face=3D"Times New Roman" size=3D"3"><span style=3D"font-size: 12pt=
;">=A0</span></font></p>

<p><font face=3D"Times New Roman" size=3D"3"><span style=3D"font-size: 12pt=
;">Hello
guys,<br>
<br>
I am a newbie of Hadoop and doing an experiment with Hadoop. <br>
My situation is: <br>
=A0+My job is expected to run continuously/frequently<br>
=A0+My reduce task require a large amount of configuration data. This confi=
g
data is specific to map output&#39;s key.<br>
--&gt;That&#39;s why, I want to avoid moving this config data around. <br>
As far as I read, nodes where reduce tasks are assigned are picked without
consideration of data locality.<br>
<br>
My question is: Is there any way to force the reduce tasks for a specific k=
ey
running on the same node?<br>
<br>
Thnx.</span></font></p>

</div>

</div>

</div>

</div>

</div>

<p><font face=3D"Times New Roman" size=3D"3"><span style=3D"font-size: 12pt=
;">=A0</span></font></p>

</div></div></div>

</div>


</blockquote></div><br>

--0016e648f6769606fd0472171825--