hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jian yi <eyj...@gmail.com>
Subject Re: Map-Balance-Reduce draft
Date Tue, 09 Feb 2010 09:23:00 GMT
Just like cutting a cake, we will cut it again if the first is not balanced.
If we can control the size of every task, system will become more simple and
optimizing is possible.

2010/2/9 jian yi <eyjian@gmail.com>

> Hi Alan,
> In my opinion, MBR solves the skew problem with minimum cost. It is differ
> with combiner. My English is poor, but I do my best to express myself
> clearly.
> In MBR, we can set the size of a task, both map task and reduce task. For
> example, 120~150MB input for a task, the motive is that we can control a
> task's run-time to keep the size of every task is almost equal, which is the
> precondition that a task can be regarded as a timeslice switched.
> By hashing output of map to more splits, we can regroup smaller splits to a
> new split which size is specified and hash bigger spits to more small
> splits, until the size of all splits is within the specified range.
> Balance interface is like map and reduce interface. Balance interface shoud
> be implemented when a single key has too many values, because the key will
> be hashed to more than one splits. In the case, we can't get the final
> results in a MBR session, it is necessary to start a next MBR session. For a
> same key, we can hash it with key+value, the action will be telled to
> balance interface.
> Regards
> Jian Yi
> 2010/2/9 Alan Gates <gates@yahoo-inc.com>
> Jian,
>> Sorry if any of my questions or comments would have been answered by the
>> diagrams, but apache lists don't allow attachments, so I can't see your
>> diagrams.
>> If I understand correctly, your suggestion for balancing is to apply
>> reduce on subsets of the hashed data, and then run reduce again on this
>> reduced data set.  Is that correct?  If so, how does this differ from the
>> combiner?  Second, some aggregation operations truly aren't algebraic (that
>> is, they cannot be distributed across multiple iterations of reduce).   An
>> example of this is session analysis, where the algorithm truly needs to see
>> all operations together to analyze the user session.  How do you propose to
>> handle that case?
>> Alan.
>> On Feb 7, 2010, at 11:25 PM, jian yi wrote:
>>  Two targets:
>>> 1. Solving the skew problem
>>> 2. Regarding a task as a timeslice to improve on scheduler, switching a
>>> job to another job by timeslice.
>>> In MR (Map-Reduce) model, reducings are not balanced, because the scale
>>> of partitiones are unbalanced. How to balance? We can control the size of
>>> partition, rehash the bigger parition and combine to the specified size. If
>>> a key has many values, it's necessary to execute mapreduce twice.The
>>> following is the model digram:
>>> Scheduler can regard a task as a timeslice similarly OS scheduler.
>>> If a split is bigger than a specified size, it will be splitted again. If
>>> a split is smaller than a specified size, it will be combined with others,
>>> we can name the combining procedure regroup. The combining is logic, it's
>>> not necessay to combine these smaller splits to a disk file, which will not
>>> affect the performance.The target is that every task spent same time
>>> running.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message