Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 37541 invoked from network); 27 Aug 2009 03:24:21 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 27 Aug 2009 03:24:21 -0000 Received: (qmail 8110 invoked by uid 500); 27 Aug 2009 03:24:21 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 8005 invoked by uid 500); 27 Aug 2009 03:24:21 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 7995 invoked by uid 99); 27 Aug 2009 03:24:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Aug 2009 03:24:20 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of eagleeye83dp@gmail.com designates 209.85.216.191 as permitted sender) Received: from [209.85.216.191] (HELO mail-px0-f191.google.com) (209.85.216.191) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Aug 2009 03:24:11 +0000 Received: by pxi29 with SMTP id 29so740597pxi.30 for ; Wed, 26 Aug 2009 20:23:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=1HoFgwsS6nOQpsY/2RJR5oV6+DvUBxbTDpPyAAuExHI=; b=dcdl7GGRSVK8q1WRcFFhcwHkEJeQdUiLqrqRnWhzkHfHCLRkSrsvaTeS7Vhv6s5FK0 tKYzcY2MlBXgRxjrK4nk0dL5deNR8H4lJIWW6UOBthX3GQ/1lxomnTcgzwU6xtXrI2X6 0g/ftoIsC/AZ0bAxGQUPW3lHkCBUeh/CisSds= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=jA+7X7B7VOFlypKl5RimMJx+WXxCgEevz4jvEpOfjiPlN3Xt6Z9zc4LXj8MWkJdJ3b NNb9I0ZQP5n4zbQ4/WZRDEMZLxf//x0Nt+UdIRccDmngcPTC3DvyUGJas3I2BQjccQt9 mCqUsKOqBGrSDi4az/jZ+NYzXBOz8SlR0mWi4= MIME-Version: 1.0 Received: by 10.115.98.14 with SMTP id a14mr12027288wam.221.1251343429141; Wed, 26 Aug 2009 20:23:49 -0700 (PDT) In-Reply-To: <616DA47B2EF5B944B91846785B512FF4CFADEA7102@EGL-EX07VS01.ds.corp.yahoo.com> References: <616DA47B2EF5B944B91846785B512FF4CFADEA70DA@EGL-EX07VS01.ds.corp.yahoo.com> <616DA47B2EF5B944B91846785B512FF4CFADEA7102@EGL-EX07VS01.ds.corp.yahoo.com> Date: Thu, 27 Aug 2009 11:23:49 +0800 Message-ID: Subject: Re: Location reduce task running. From: fan wei fang To: mapreduce-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016e648f6769606fd0472171825 X-Virus-Checked: Checked by ClamAV on apache.org --0016e648f6769606fd0472171825 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hi Amogh, Thank you for your constructive opinion. The problem is that, data for these filters is very huge. The performance may be so poor if it is moved around. I am thinking of another workaround. Instead of starting a new reduce task for an incoming email, I will let the reduce task live as long as possible, and use locally cached filter to process several emails. But there's another problem with this method. As far as I read, reduce task won't start until all map tasks finish. I think the way Hadoop work is batch-processing fashion, it gathers a large amount of data and process it at once. But what I want is something like stream-processing fashion. Each map's output (k,v) is immediately transferred to and processed by reduce node. In other words, reduce node starts right after it receives first intermediate (k,v) and stay alive waiting for sub-sequence (k,v)'s. Is there any way to force Hadoop to work in this stream-processing fashion? I think there's one way that is messing up with Hadoop code at shuffling stage but I think it should be the last choice. Regards. Frank. On Mon, Aug 24, 2009 at 5:22 PM, Amogh Vasekar wrote > >> In order to achieve efficiency, I don't want these pieces of spam > filters moving around the nodes in cluster. > > If you are flexible on this, you can pass both mails and config data to > mappers, do common processing for mails, transform the K,V pair for each > user/mailbox , and use custom partitioner, comparator to pass user-specif= ic > mails and filters to a single reducer and process as needed. If the size = of > config file is << than mail sizes ( maybe na=EFve, but should hold good )= , its > not much of an inefficiency. This **should** be better than 2 mapred job= s > , where you would be writing twice to the hdfs. > > Hope this helps, just the first thing that came to my mind. > > > > Thanks, > > Amogh > > > ------------------------------ > > *From:* fan wei fang [mailto:eagleeye83dp@gmail.com] > *Sent:* Monday, August 24, 2009 12:03 PM > *To:* mapreduce-user@hadoop.apache.org > *Subject:* Re: Location reduce task running. > > > > Hi Amogh, > > I appreciate your quick response. > Please correct me if I'm wrong. If the workload of reducers is transferre= d > to combiners, does it mean every map node must hold a copy of my config. > data? If this is the case, it is completely unacceptable for my app. > > Let me further explain the situation for you. > I am trying to build an anti-spam system using Map-Reduce. In this system= , > users are allowed to have their own spam filters. The whole set of these > filters are so huge that it shouldn't be put in any single node. Therefor= e, > I have to split them to nodes. Each node will be responsible for only a > small number email boxes. > In order to achieve efficiency, I don't want these pieces of spam filters > moving around the nodes in cluster. > > This is the data flow of my app. > > Mails ---> Map (do common processing for emails) ---> Reduce (do > user-specific processing) ---> Store mails to designated boxes. > > Do you have any suggestion? I am thinking about JVM re-use feature of > Hadoop or I can set up a chain of two map-reduce pairs. > > Best regards. > Fang. > > > On Mon, Aug 24, 2009 at 1:25 PM, Amogh Vasekar > wrote: > > No, but if you want a =93reducer like=94 functionality on the same node, = have a > look at combiners. To get exact functionality you might need to tweak aro= und > a little wrt buffers, flush etc. > > > > Cheers! > > Amogh > > > ------------------------------ > > *From:* fan wei fang [mailto:eagleeye83dp@gmail.com] > *Sent:* Monday, August 24, 2009 9:17 AM > *To:* mapreduce-user@hadoop.apache.org > *Subject:* Location reduce task running. > > > > Hello guys, > > I am a newbie of Hadoop and doing an experiment with Hadoop. > My situation is: > +My job is expected to run continuously/frequently > +My reduce task require a large amount of configuration data. This confi= g > data is specific to map output's key. > -->That's why, I want to avoid moving this config data around. > As far as I read, nodes where reduce tasks are assigned are picked withou= t > consideration of data locality. > > My question is: Is there any way to force the reduce tasks for a specific > key running on the same node? > > Thnx. > > > --0016e648f6769606fd0472171825 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hi Amogh,
Thank you for your constructive opinion.
The problem is th= at, data for these filters is very huge. The performance may be so poor if = it is moved around.
I am thinking of another workaround.

Instead= of starting a new reduce task for an incoming email, I will let the reduce= task live as long as possible,=A0 and use locally cached filter=A0 to proc= ess several emails.
But there's another problem with this method. As far as I read, reduce = task won't start until all map tasks finish. I think the way Hadoop wor= k is batch-processing fashion, it gathers a large amount of data and proces= s it at once. But what I want is something like stream-processing fashion. = Each map's output (k,v) is immediately transferred to and processed by = reduce node. In other words, reduce node starts right after it receives fir= st intermediate (k,v) and stay alive waiting for sub-sequence (k,v)'s.<= br>
Is there any way to force Hadoop to work in this stream-processing fash= ion?
I think there's one way that is messing up with Hadoop code at= shuffling stage but I think it should be the last choice.

Regards.<= br> Frank.


On Mon, Aug 24, 2009 at 5:22 P= M, Amogh Vasekar <amogh@yahoo-inc.com> wrote

>> In order to= achieve efficiency, I don't want these pieces of spam filters moving around the= nodes in cluster.

If you are flexible on this, you can pass both mails and config da= ta to mappers, do common processing for mails, transform the K,V pair for each us= er/mailbox , and use custom partitioner, comparator to pass user-specific mails and filters to a single reducer and process as needed. If the size of config fi= le is << than mail sizes ( maybe na=EFve, but should hold good ), its no= t much of an inefficiency. This *should* =A0be better than 2 mapred jobs , where you would be writing twice to the hdfs.

Hope this helps, just the first thing that came to my mind.

=A0

Thanks,

Amogh

=A0


From: fa= n wei fang [mailto:eagleey= e83dp@gmail.com]
Sent: Monday, August 24, 2= 009 12:03 PM

Subject: Re: Locatio= n reduce task running.

=A0

Hi Amogh,

I appreciate your quick response.
Please correct me if I'm wrong. If the workload of reducers is transfer= red to combiners, does it mean every map node must hold a copy of my config. data?= If this is the case, it is completely unacceptable for my app.

Let me further explain the situation for you.
I am trying to build an anti-spam system using Map-Reduce. In this system, users are allowed to have their own spam filters. The whole set of these filters are so huge that it shouldn't be put in any single node. Theref= ore, I have to split them to nodes. Each node will be responsible for only a small number email boxes.
In order to achieve efficiency, I don't want these pieces of spam filte= rs moving around the nodes in cluster.

This is the data flow of my app.

Mails ---> Map (do common processing for emails) ---> Reduce (do user-specific processing) ---> Store mails to designated boxes.

Do you have any suggestion? I am thinking about JVM re-use feature of Hadoo= p or I can set up a chain of two map-reduce pairs.

Best regards.
Fang.


On Mon, Aug 24, 2009 at 1:25 PM, Amogh Vasekar <amogh@yahoo-inc.com> wrote:

No, but if you want a =93reducer l= ike=94 functionality on the same node, have a look at combiners. To get exact functionality you might need to tweak around a little wrt buffers, flush et= c.

=A0

Cheers!

Amogh

=A0


From: fa= n wei fang [mailto:eagleeye83dp@gmail.com]
Sent: Monday, August 24, 2= 009 9:17 AM
To: mapreduce-user@hadoop.apach= e.org
Subject: Location reduce t= ask running.

=A0

Hello guys,

I am a newbie of Hadoop and doing an experiment with Hadoop.
My situation is:
=A0+My job is expected to run continuously/frequently
=A0+My reduce task require a large amount of configuration data. This confi= g data is specific to map output's key.
-->That's why, I want to avoid moving this config data around.
As far as I read, nodes where reduce tasks are assigned are picked without consideration of data locality.

My question is: Is there any way to force the reduce tasks for a specific k= ey running on the same node?

Thnx.

=A0


--0016e648f6769606fd0472171825--