hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanley Xu <wenhao...@gmail.com>
Subject Re: Is there any way I could use to reduce the cost of Mapper and Reducer setup and cleanup in a iterative MapReduce chain?
Date Thu, 05 May 2011 15:29:33 GMT
Thanks a lot. Ted, checking haloop and plume now. I could always get the
answer from you. :-)

On Thu, May 5, 2011 at 10:42 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Stanley,
>
> The short answer is that this is a real problem.
>
> Try this:
>
> *Spark: Cluster Computing with Working Sets.* Matei Zaharia, Mosharaf
> Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica, in HotCloud 2010,
> June 2010.
>
> Or this http://www.iterativemapreduce.org/
>
> http://code.google.com/p/haloop/
>
> You may be interested in experimenting with MapReduce 2.0.  THat allows
> more flexibility in execution model:
>
>
> http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/
>
> Systems like FlumeJava (and my open source, incomplete clone Plume) may
> help with flexibility:
>
>
> http://www.deepdyve.com/lp/association-for-computing-machinery/flumejava-easy-efficient-data-parallel-pipelines-xtUvap2t1I
>
>
> https://github.com/tdunning/Plume/commit/a5a10feaa068b33b1d929c332e4614aba50dd39a
>
>
> On Thu, May 5, 2011 at 2:16 AM, Stanley Xu <wenhao.xu@gmail.com> wrote:
>
>> Dear All,
>>
>> Our team is trying to implement a parallelized LDA with Gibbs Sampling. We
>> are using the algorithm mentioned by plda, http://code.google.com/p/plda/
>>
>> The problem is that by the Map-Reduce method the paper mentioned. We need
>> to
>> run a MapReduce job every gibbs sampling iteration, and normally, it will
>> use 1000 - 2000 iterations per our test with our data to converge. But as
>> we
>> know, there is a cost to re-create the mapper/reducer, and cleanup the
>> mapper/reducer in every iteration. It will take about 40 seconds on our
>> cluster per our test, and 1000 iteration means almost 12 hours.
>>
>> I am wondering if there is a way to reduce the cost of Mapper/Reducer
>> setup/cleanup, since I prefer to have all the mappers to read the same
>> local
>> data and update the local data in a mapper process. All the other update
>> it
>> need comes from the reducer which is a pretty small data compare to the
>> whole dataset.
>>
>> Is there any approach I could try(including change part of hadoop's source
>> code.)?
>>
>>
>> Best wishes,
>> Stanley Xu
>>
>
>

Mime
View raw message