hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Map-Balance-Reduce draft
Date Tue, 09 Feb 2010 17:16:18 GMT
Hi Jian,

Are you essentially proposing "work stealing"? That is to say, if a map task
is taking longer than others, another map task will be able to grab some of
its work from it after the job is already under way?

-Todd

On Tue, Feb 9, 2010 at 1:23 AM, jian yi <eyjian@gmail.com> wrote:

> Just like cutting a cake, we will cut it again if the first is not
> balanced.
> If we can control the size of every task, system will become more simple
> and
> optimizing is possible.
>
> 2010/2/9 jian yi <eyjian@gmail.com>
>
> > Hi Alan,
> >
> > In my opinion, MBR solves the skew problem with minimum cost. It is
> differ
> > with combiner. My English is poor, but I do my best to express myself
> > clearly.
> >
> > In MBR, we can set the size of a task, both map task and reduce task. For
> > example, 120~150MB input for a task, the motive is that we can control a
> > task's run-time to keep the size of every task is almost equal, which is
> the
> > precondition that a task can be regarded as a timeslice switched.
> >
> > By hashing output of map to more splits, we can regroup smaller splits to
> a
> > new split which size is specified and hash bigger spits to more small
> > splits, until the size of all splits is within the specified range.
> >
> > Balance interface is like map and reduce interface. Balance interface
> shoud
> > be implemented when a single key has too many values, because the key
> will
> > be hashed to more than one splits. In the case, we can't get the final
> > results in a MBR session, it is necessary to start a next MBR session.
> For a
> > same key, we can hash it with key+value, the action will be telled to
> > balance interface.
> >
> > Regards
> > Jian Yi
> >
> > 2010/2/9 Alan Gates <gates@yahoo-inc.com>
> >
> > Jian,
> >>
> >> Sorry if any of my questions or comments would have been answered by the
> >> diagrams, but apache lists don't allow attachments, so I can't see your
> >> diagrams.
> >>
> >> If I understand correctly, your suggestion for balancing is to apply
> >> reduce on subsets of the hashed data, and then run reduce again on this
> >> reduced data set.  Is that correct?  If so, how does this differ from
> the
> >> combiner?  Second, some aggregation operations truly aren't algebraic
> (that
> >> is, they cannot be distributed across multiple iterations of reduce).
> An
> >> example of this is session analysis, where the algorithm truly needs to
> see
> >> all operations together to analyze the user session.  How do you propose
> to
> >> handle that case?
> >>
> >> Alan.
> >>
> >>
> >> On Feb 7, 2010, at 11:25 PM, jian yi wrote:
> >>
> >>  Two targets:
> >>> 1. Solving the skew problem
> >>> 2. Regarding a task as a timeslice to improve on scheduler, switching a
> >>> job to another job by timeslice.
> >>>
> >>> In MR (Map-Reduce) model, reducings are not balanced, because the scale
> >>> of partitiones are unbalanced. How to balance? We can control the size
> of
> >>> partition, rehash the bigger parition and combine to the specified
> size. If
> >>> a key has many values, it's necessary to execute mapreduce twice.The
> >>> following is the model digram:
> >>>
> >>> Scheduler can regard a task as a timeslice similarly OS scheduler.
> >>> If a split is bigger than a specified size, it will be splitted again.
> If
> >>> a split is smaller than a specified size, it will be combined with
> others,
> >>> we can name the combining procedure regroup. The combining is logic,
> it's
> >>> not necessay to combine these smaller splits to a disk file, which will
> not
> >>> affect the performance.The target is that every task spent same time
> >>> running.
> >>>
> >>>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message