hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Jain <rja...@gmail.com>
Subject Re: Can MapReduce run simultaneous producer/consumer processes?
Date Fri, 07 Jan 2011 00:42:22 GMT
In case the producer / consumer don't require sorting to happen, take a look
at ChainMapper:


If you do want the stuff to happen after sorting, take a look at:


More esoteric cases will require you to do separate map reduces, at least
with the current hadoop framework.

On Thu, Jan 6, 2011 at 2:27 PM, W.P. McNeill <billmcn@gmail.com> wrote:

> Say I have two MapReduce processes, A and B.  The two are algorithmically
> dissimilar, so they have to be implemented as separate MapReduce processes.
>  The output of A is used as the input of B, so A has to run first.
>  However,
> B doesn't need to take all of A's output as input, only a partition of it.
>  So in theory A and B could run at the same time in a producer/consumer
> arrangement, where B would start to work as soon as A had produced some
> output but before A had completed.  Obviously, this could be a big
> parallelization win.
> Is this possible in MapReduce?  I know at the most basic level it is
> not–there is no synchronization mechanism that allows the same HDFS
> directory to be used for both input and output–but is there some
> abstraction
> layer on top that allows it?  I've been digging around, and I think the
> answer is "No" but I want to be sure.
> More specifically, the only abstraction layer I'm aware of that chains
> together MapReduce processes is Cascade, and I think it requires the reduce
> steps to be serialized, but again I'm not sure because I've only read the
> documentation and haven't actually played with it.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message