hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amogh Vasekar <am...@yahoo-inc.com>
Subject Re: avoiding data redistribution in iterative mapreduce
Date Wed, 03 Feb 2010 14:59:14 GMT
If each of your sequential iteration is map+reduce, then no.
The lifetime of a split is confined to a single map reduce job. The split is actually a reference
to data, which is used to schedule job as close as possible to data. The record reader then
uses same object to pass the <k,v> in split.
What you need, I believe, is "just run on whatever map has". If you are using an exclusive
private cluster, you can probably localize <k,v> from first iteration and use dummy
input data ( to ensure same number of mapper tasks as first round, and use custom classes
of MapRunner, RecordReader to not read data from supplied input )But how can you ensure that
you get the same nodes always to run your map reduce job on a shared cluster?
Please correct me if I misunderstood your question.


On 2/3/10 11:34 AM, "Raghava Mutharaju" <m.vijayaraghava@gmail.com> wrote:

Hi all,

      I to run a map reduce task repeatedly in order to achieve the desired result. Is it
possible that at the beginning of each iteration, the data set is not distributed (divided
into chunks and distributed) again and again i.e. once the distribution occurs for the first
time, map nodes should work on the same chunk in every iteration. Can this be done? I only
have a brief experience with MapReduce and I think that the input data set is redistributed
every time.

Thank you.


View raw message