hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rares Vernica <rvern...@gmail.com>
Subject control map to split assignment
Date Thu, 27 Aug 2009 01:52:45 GMT

I wonder is there is a way to control how maps are assigned to splits
in order to balance the load across the cluster.

Here is a simplified example. I have tow types of inputs: "long" and
"short". Each input is in a different file and will be processed by a
single map task. Suppose the "long" inputs take 10s to process while
the "short" inputs take 3s to process. I have two "long" inputs and
two "short" inputs. My cluster has 2 nodes and each node can execute
only one map task at a time. A possible schedule of the tasks could be
the following:

Node 1: "long map", "short map" -> 10s + 3s = 13s
Node 2: "long map", "short map" -> 10s + 3s = 13s

So, my job will be done in 13s. Another possible schedule is:

Node 1: "long map" -> 10s
Node 2: "short map", "short map", "long map" -> 3s + 3s + 10s = 16s

And, my job will be done in 16s. Clearly, the first scheduling is better.

Is there a way to control how the schedule is build? If I can control
which inputs are processed first, I could schedule the "long" inputs
to be processed first and so they will be balanced across nodes and I
will end up with something similar to the first schedule.

I could configure the job so that a "long" input gets processed by
more that a map, and so end up balancing the work, but I noticed that
overall, this takes more time than a bad scheduling with only one map
per input.


Rares Vernica

View raw message