hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From fan wei fang <eagleeye8...@gmail.com>
Subject Re: Location reduce task running.
Date Mon, 24 Aug 2009 06:33:20 GMT
Hi Amogh,

I appreciate your quick response.
Please correct me if I'm wrong. If the workload of reducers is transferred
to combiners, does it mean every map node must hold a copy of my config.
data? If this is the case, it is completely unacceptable for my app.

Let me further explain the situation for you.
I am trying to build an anti-spam system using Map-Reduce. In this system,
users are allowed to have their own spam filters. The whole set of these
filters are so huge that it shouldn't be put in any single node. Therefore,
I have to split them to nodes. Each node will be responsible for only a
small number email boxes.
In order to achieve efficiency, I don't want these pieces of spam filters
moving around the nodes in cluster.

This is the data flow of my app.

Mails ---> Map (do common processing for emails) ---> Reduce (do
user-specific processing) ---> Store mails to designated boxes.

Do you have any suggestion? I am thinking about JVM re-use feature of Hadoop
or I can set up a chain of two map-reduce pairs.

Best regards.

On Mon, Aug 24, 2009 at 1:25 PM, Amogh Vasekar <amogh@yahoo-inc.com> wrote:

>  No, but if you want a “reducer like” functionality on the same node, have
> a look at combiners. To get exact functionality you might need to tweak
> around a little wrt buffers, flush etc.
> Cheers!
> Amogh
>  ------------------------------
> *From:* fan wei fang [mailto:eagleeye83dp@gmail.com]
> *Sent:* Monday, August 24, 2009 9:17 AM
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* Location reduce task running.
> Hello guys,
> I am a newbie of Hadoop and doing an experiment with Hadoop.
> My situation is:
>  +My job is expected to run continuously/frequently
>  +My reduce task require a large amount of configuration data. This config
> data is specific to map output's key.
> -->That's why, I want to avoid moving this config data around.
> As far as I read, nodes where reduce tasks are assigned are picked without
> consideration of data locality.
> My question is: Is there any way to force the reduce tasks for a specific
> key running on the same node?
> Thnx.

View raw message