hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghava Mutharaju <m.vijayaragh...@gmail.com>
Subject Re: avoiding data redistribution in iterative mapreduce
Date Tue, 09 Feb 2010 18:44:11 GMT
Hi,

    No problem, I am thankful that someone has replied to my question.

Known location -- can/will it be the HDFS or some distributed key-value
store?

Regards,
Raghava.

On Tue, Feb 9, 2010 at 12:40 AM, Amogh Vasekar <amogh@yahoo-inc.com> wrote:

>  Hi,
> AFAIK no. I’m not sure how much of a task it is to write a HOD-like
> scheduler, or if its even feasible given the new architecture of single
> managing JT, directly talking to TT. Probably someone more familiar with the
> scheduler architecture can help you better.
> What I was trying to suggest with serialization was write initial mapper
> data to known location, and instead of streaming from split, ignore that and
> read form here.
> Sorry for the delayed response,
>
> Amogh
>
>
>
>
> On 2/4/10 2:01 PM, "Raghava Mutharaju" <m.vijayaraghava@gmail.com> wrote:
>
> Hi,
>
>      So is it not possible to avoid redistribution in this case? If that is
> the case, can a custom scheduler be written -- will it be any easy task?
>
> Regards,
> Raghava.
>
> On Thu, Feb 4, 2010 at 2:52 AM, Amogh Vasekar <amogh@yahoo-inc.com> wrote:
>
> Hi,
>
> >>Will there be a re-assignment of Map & Reduce nodes by the Master?
> In general using available schedulers, I believe so. Because if it weren’t,
> and I submit job 2 needing different/additional set of inputs, the data
> locality considerations would be somewhat hampered right? When we had HOD,
> this was certainly possible.
>
> Amogh
>
>
>
> On 2/4/10 1:06 AM, "Raghava Mutharaju" <m.vijayaraghava@gmail.com <
> http://m.vijayaraghava@gmail.com> > wrote:
>
> Hi Amogh,
>
>        Thank you for the reply.
>
> >>> What you need, I believe, is “just run on whatever map has”.
>             You got that right :). An example of sequential program would
> be Bubble Sort which needs several iterations for the end result and in each
> iteration it needs to work on the previous output (partially sorted list)
> rather than the initial input. In my case also, the same thing should
> happen.
>
> >>> If you are using an exclusive private cluster, you can probably
> localize <k,v> from first iteration and >>> use dummy input data ( to
ensure
> same number of mapper tasks as first round, and use custom >>> classes of
> MapRunner, RecordReader to not read data from supplied input )
>
>           Yes, it would be a local cluster, the one in my university. If we
> set the no of map tasks, would it not be followed in each iteration? As
> mentioned in the documentation, I think I need to use JobClient to control
> the no of iterations.
>
>
> >>> But how can you ensure that you get the same nodes always to run your
> map reduce job on a
> >>> shared cluster?
>
>            while (!done) { JobClient.runJob(jobConf); <<Do something to
> check termination condition>>}
>
> If I write something like that in the code, would not the Map node run on
> the same data chunk it has each time? Will there be a re-assignment of Map &
> Reduce nodes by the Master?
>
>
> Regards,
> Raghava.
>
> On Wed, Feb 3, 2010 at 9:59 AM, Amogh Vasekar <amogh@yahoo-inc.com <
> http://amogh@yahoo-inc.com> > wrote:
>
> Hi,
> If each of your sequential iteration is map+reduce, then no.
> The lifetime of a split is confined to a single map reduce job. The split
> is actually a reference to data, which is used to schedule job as close as
> possible to data. The record reader then uses same object to pass the <k,v>
> in split.
> What you need, I believe, is “just run on whatever map has”. If you are
> using an exclusive private cluster, you can probably localize <k,v> from
> first iteration and use dummy input data ( to ensure same number of mapper
> tasks as first round, and use custom classes of MapRunner, RecordReader to
> not read data from supplied input )But how can you ensure that you get the
> same nodes always to run your map reduce job on a shared cluster?
> Please correct me if I misunderstood your question.
>
> Amogh
>
>
>
> On 2/3/10 11:34 AM, "Raghava Mutharaju" <m.vijayaraghava@gmail.com <
> http://m.vijayaraghava@gmail.com>  <http://m.vijayaraghava@gmail.com> >
> wrote:
>
> Hi all,
>
>       I to run a map reduce task repeatedly in order to achieve the desired
> result. Is it possible that at the beginning of each iteration, the data set
> is not distributed (divided into chunks and distributed) again and again
> i.e. once the distribution occurs for the first time, map nodes should work
> on the same chunk in every iteration. Can this be done? I only have a brief
> experience with MapReduce and I think that the input data set is
> redistributed every time.
>
> Thank you.
>
> Regards,
> Raghava.
>
>
>
>
>
>

Mime
View raw message