hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amogh Vasekar <am...@yahoo-inc.com>
Subject RE: Location reduce task running.
Date Mon, 24 Aug 2009 09:22:45 GMT
>> In order to achieve efficiency, I don't want these pieces of spam filters moving around
the nodes in cluster.
If you are flexible on this, you can pass both mails and config data to mappers, do common
processing for mails, transform the K,V pair for each user/mailbox , and use custom partitioner,
comparator to pass user-specific mails and filters to a single reducer and process as needed.
If the size of config file is << than mail sizes ( maybe naïve, but should hold good
), its not much of an inefficiency. This *should*  be better than 2 mapred jobs , where you
would be writing twice to the hdfs.
Hope this helps, just the first thing that came to my mind.

Thanks,
Amogh

________________________________
From: fan wei fang [mailto:eagleeye83dp@gmail.com]
Sent: Monday, August 24, 2009 12:03 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Location reduce task running.

Hi Amogh,

I appreciate your quick response.
Please correct me if I'm wrong. If the workload of reducers is transferred to combiners, does
it mean every map node must hold a copy of my config. data? If this is the case, it is completely
unacceptable for my app.

Let me further explain the situation for you.
I am trying to build an anti-spam system using Map-Reduce. In this system, users are allowed
to have their own spam filters. The whole set of these filters are so huge that it shouldn't
be put in any single node. Therefore, I have to split them to nodes. Each node will be responsible
for only a small number email boxes.
In order to achieve efficiency, I don't want these pieces of spam filters moving around the
nodes in cluster.

This is the data flow of my app.

Mails ---> Map (do common processing for emails) ---> Reduce (do user-specific processing)
---> Store mails to designated boxes.

Do you have any suggestion? I am thinking about JVM re-use feature of Hadoop or I can set
up a chain of two map-reduce pairs.

Best regards.
Fang.


On Mon, Aug 24, 2009 at 1:25 PM, Amogh Vasekar <amogh@yahoo-inc.com<mailto:amogh@yahoo-inc.com>>
wrote:

No, but if you want a "reducer like" functionality on the same node, have a look at combiners.
To get exact functionality you might need to tweak around a little wrt buffers, flush etc.



Cheers!

Amogh



________________________________

From: fan wei fang [mailto:eagleeye83dp@gmail.com<mailto:eagleeye83dp@gmail.com>]
Sent: Monday, August 24, 2009 9:17 AM
To: mapreduce-user@hadoop.apache.org<mailto:mapreduce-user@hadoop.apache.org>
Subject: Location reduce task running.



Hello guys,

I am a newbie of Hadoop and doing an experiment with Hadoop.
My situation is:
 +My job is expected to run continuously/frequently
 +My reduce task require a large amount of configuration data. This config data is specific
to map output's key.
-->That's why, I want to avoid moving this config data around.
As far as I read, nodes where reduce tasks are assigned are picked without consideration of
data locality.

My question is: Is there any way to force the reduce tasks for a specific key running on the
same node?

Thnx.


Mime
View raw message