hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From fan wei fang <eagleeye8...@gmail.com>
Subject Re: Location reduce task running.
Date Thu, 27 Aug 2009 03:23:49 GMT
Hi Amogh,
Thank you for your constructive opinion.
The problem is that, data for these filters is very huge. The performance
may be so poor if it is moved around.
I am thinking of another workaround.

Instead of starting a new reduce task for an incoming email, I will let the
reduce task live as long as possible,  and use locally cached filter  to
process several emails.
But there's another problem with this method. As far as I read, reduce task
won't start until all map tasks finish. I think the way Hadoop work is
batch-processing fashion, it gathers a large amount of data and process it
at once. But what I want is something like stream-processing fashion. Each
map's output (k,v) is immediately transferred to and processed by reduce
node. In other words, reduce node starts right after it receives first
intermediate (k,v) and stay alive waiting for sub-sequence (k,v)'s.

Is there any way to force Hadoop to work in this stream-processing fashion?
I think there's one way that is messing up with Hadoop code at shuffling
stage but I think it should be the last choice.

Regards.
Frank.


On Mon, Aug 24, 2009 at 5:22 PM, Amogh Vasekar <amogh@yahoo-inc.com> wrote

>  >> In order to achieve efficiency, I don't want these pieces of spam
> filters moving around the nodes in cluster.
>
> If you are flexible on this, you can pass both mails and config data to
> mappers, do common processing for mails, transform the K,V pair for each
> user/mailbox , and use custom partitioner, comparator to pass user-specific
> mails and filters to a single reducer and process as needed. If the size of
> config file is << than mail sizes ( maybe naïve, but should hold good ), its
> not much of an inefficiency. This **should**  be better than 2 mapred jobs
> , where you would be writing twice to the hdfs.
>
> Hope this helps, just the first thing that came to my mind.
>
>
>
> Thanks,
>
> Amogh
>
>
>  ------------------------------
>
> *From:* fan wei fang [mailto:eagleeye83dp@gmail.com]
> *Sent:* Monday, August 24, 2009 12:03 PM
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* Re: Location reduce task running.
>
>
>
> Hi Amogh,
>
> I appreciate your quick response.
> Please correct me if I'm wrong. If the workload of reducers is transferred
> to combiners, does it mean every map node must hold a copy of my config.
> data? If this is the case, it is completely unacceptable for my app.
>
> Let me further explain the situation for you.
> I am trying to build an anti-spam system using Map-Reduce. In this system,
> users are allowed to have their own spam filters. The whole set of these
> filters are so huge that it shouldn't be put in any single node. Therefore,
> I have to split them to nodes. Each node will be responsible for only a
> small number email boxes.
> In order to achieve efficiency, I don't want these pieces of spam filters
> moving around the nodes in cluster.
>
> This is the data flow of my app.
>
> Mails ---> Map (do common processing for emails) ---> Reduce (do
> user-specific processing) ---> Store mails to designated boxes.
>
> Do you have any suggestion? I am thinking about JVM re-use feature of
> Hadoop or I can set up a chain of two map-reduce pairs.
>
> Best regards.
> Fang.
>
>
>  On Mon, Aug 24, 2009 at 1:25 PM, Amogh Vasekar <amogh@yahoo-inc.com>
> wrote:
>
> No, but if you want a “reducer like” functionality on the same node, have a
> look at combiners. To get exact functionality you might need to tweak around
> a little wrt buffers, flush etc.
>
>
>
> Cheers!
>
> Amogh
>
>
>  ------------------------------
>
> *From:* fan wei fang [mailto:eagleeye83dp@gmail.com]
> *Sent:* Monday, August 24, 2009 9:17 AM
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* Location reduce task running.
>
>
>
> Hello guys,
>
> I am a newbie of Hadoop and doing an experiment with Hadoop.
> My situation is:
>  +My job is expected to run continuously/frequently
>  +My reduce task require a large amount of configuration data. This config
> data is specific to map output's key.
> -->That's why, I want to avoid moving this config data around.
> As far as I read, nodes where reduce tasks are assigned are picked without
> consideration of data locality.
>
> My question is: Is there any way to force the reduce tasks for a specific
> key running on the same node?
>
> Thnx.
>
>
>

Mime
View raw message