hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jian yi <eyj...@gmail.com>
Subject Map-Balance-Reduce draft
Date Mon, 08 Feb 2010 07:25:52 GMT
Two targets:
1. Solving the skew problem
2. Regarding a task as a timeslice to improve on scheduler, switching a job
to another job by timeslice.

In MR (Map-Reduce) model, reducings are not balanced, because the scale of
partitiones are unbalanced. How to balance? We can control the size of
partition, rehash the bigger parition and combine to the specified size. If
a key has many values, it's necessary to execute mapreduce twice.The
following is the model digram:
[image:
?ui=2&view=att&th=126ac73d6290bd76&attid=0.1&disp=attd&realattid=ii_126ac73d6290bd76&zw]
Scheduler can regard a task as a timeslice similarly OS scheduler.
If a split is bigger than a specified size, it will be splitted again. If a
split is smaller than a specified size, it will be combined with others, we
can name the combining procedure regroup. The combining is logic, it's not
necessay to combine these smaller splits to a disk file, which will not
affect the performance.The target is that every task spent same time
running.
[image:
?ui=2&view=att&th=126ac741bba5c355&attid=0.1&disp=attd&realattid=ii_126ac741bba5c355&zw]

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message