hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jian yi <eyj...@gmail.com>
Subject Re: Map-Balance-Reduce draft
Date Tue, 09 Feb 2010 07:45:16 GMT
Hi Alan,

In my opinion, MBR solves the skew problem with minimum cost. It is differ
with combiner. My English is poor, but I do my best to express myself

In MBR, we can set the size of a task, both map task and reduce task. For
example, 120~150MB input for a task, the motive is that we can control a
task's run-time to keep the size of every task is almost equal, which is the
precondition that a task can be regarded as a timeslice switched.

By hashing output of map to more splits, we can regroup smaller splits to a
new split which size is specified and hash bigger spits to more small
splits, until the size of all splits is within the specified range.

Balance interface is like map and reduce interface. Balance interface shoud
be implemented when a single key has too many values, because the key will
be hashed to more than one splits. In the case, we can't get the final
results in a MBR session, it is necessary to start a next MBR session. For a
same key, we can hash it with key+value, the action will be telled to
balance interface.

Jian Yi

2010/2/9 Alan Gates <gates@yahoo-inc.com>

> Jian,
> Sorry if any of my questions or comments would have been answered by the
> diagrams, but apache lists don't allow attachments, so I can't see your
> diagrams.
> If I understand correctly, your suggestion for balancing is to apply reduce
> on subsets of the hashed data, and then run reduce again on this reduced
> data set.  Is that correct?  If so, how does this differ from the combiner?
>  Second, some aggregation operations truly aren't algebraic (that is, they
> cannot be distributed across multiple iterations of reduce).   An example of
> this is session analysis, where the algorithm truly needs to see all
> operations together to analyze the user session.  How do you propose to
> handle that case?
> Alan.
> On Feb 7, 2010, at 11:25 PM, jian yi wrote:
>  Two targets:
>> 1. Solving the skew problem
>> 2. Regarding a task as a timeslice to improve on scheduler, switching a
>> job to another job by timeslice.
>> In MR (Map-Reduce) model, reducings are not balanced, because the scale of
>> partitiones are unbalanced. How to balance? We can control the size of
>> partition, rehash the bigger parition and combine to the specified size. If
>> a key has many values, it's necessary to execute mapreduce twice.The
>> following is the model digram:
>> Scheduler can regard a task as a timeslice similarly OS scheduler.
>> If a split is bigger than a specified size, it will be splitted again. If
>> a split is smaller than a specified size, it will be combined with others,
>> we can name the combining procedure regroup. The combining is logic, it's
>> not necessay to combine these smaller splits to a disk file, which will not
>> affect the performance.The target is that every task spent same time
>> running.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message