hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghava Mutharaju <m.vijayaragh...@gmail.com>
Subject Re: avoiding data redistribution in iterative mapreduce
Date Wed, 03 Feb 2010 19:36:38 GMT
Hi Amogh,

       Thank you for the reply.

>>> What you need, I believe, is “just run on whatever map has”.
            You got that right :). An example of sequential program would be
Bubble Sort which needs several iterations for the end result and in each
iteration it needs to work on the previous output (partially sorted list)
rather than the initial input. In my case also, the same thing should
happen.

>>> If you are using an exclusive private cluster, you can probably localize
<k,v> from first iteration and >>> use dummy input data ( to ensure same
number of mapper tasks as first round, and use custom >>> classes of
MapRunner, RecordReader to not read data from supplied input )

          Yes, it would be a local cluster, the one in my university. If we
set the no of map tasks, would it not be followed in each iteration? As
mentioned in the documentation, I think I need to use JobClient to control
the no of iterations.


>>> But how can you ensure that you get the same nodes always to run your
map reduce job on a
>>> shared cluster?

           while (!done) { JobClient.runJob(jobConf); <<Do something to
check termination condition>>}

If I write something like that in the code, would not the Map node run on
the same data chunk it has each time? Will there be a re-assignment of Map &
Reduce nodes by the Master?


Regards,
Raghava.

On Wed, Feb 3, 2010 at 9:59 AM, Amogh Vasekar <amogh@yahoo-inc.com> wrote:

>  Hi,
> If each of your sequential iteration is map+reduce, then no.
> The lifetime of a split is confined to a single map reduce job. The split
> is actually a reference to data, which is used to schedule job as close as
> possible to data. The record reader then uses same object to pass the <k,v>
> in split.
> What you need, I believe, is “just run on whatever map has”. If you are
> using an exclusive private cluster, you can probably localize <k,v> from
> first iteration and use dummy input data ( to ensure same number of mapper
> tasks as first round, and use custom classes of MapRunner, RecordReader to
> not read data from supplied input )But how can you ensure that you get the
> same nodes always to run your map reduce job on a shared cluster?
> Please correct me if I misunderstood your question.
>
> Amogh
>
>
>
> On 2/3/10 11:34 AM, "Raghava Mutharaju" <m.vijayaraghava@gmail.com> wrote:
>
> Hi all,
>
>       I to run a map reduce task repeatedly in order to achieve the desired
> result. Is it possible that at the beginning of each iteration, the data set
> is not distributed (divided into chunks and distributed) again and again
> i.e. once the distribution occurs for the first time, map nodes should work
> on the same chunk in every iteration. Can this be done? I only have a brief
> experience with MapReduce and I think that the input data set is
> redistributed every time.
>
> Thank you.
>
> Regards,
> Raghava.
>
>

Mime
View raw message