hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy Ryza <sandy.r...@cloudera.com>
Subject Re: Non data-local scheduling
Date Thu, 03 Oct 2013 17:03:55 GMT
Hi Andre,

Try setting yarn.scheduler.capacity.node-locality-delay to a number between
0 and 1.  This will turn on delay scheduling - here's the doc on how this

For applications that request containers on particular nodes, the number of
scheduling opportunities since the last container assignment to wait before
accepting a placement on another node. Expressed as a float between 0 and
1, which, as a fraction of the cluster size, is the number of scheduling
opportunities to pass up. The default value of -1.0 means don't pass up any
scheduling opportunities.


On Thu, Oct 3, 2013 at 9:57 AM, André Hacker <andrephacker@gmail.com> wrote:

> Hi,
> I have a 25 node cluster, running hadoop 2.1.0-beta, with capacity
> scheduler (default settings for scheduler) and replication factor 3.
> I have exclusive access to the cluster to run a benchmark job and I wonder
> why there are so few data-local and so many rack-local maps.
> The input format calculates 44 input splits and 44 map tasks, however, it
> seems to be random how many of them are processed data locally. Here the
> counters of my last tries:
> data-local / rack-local:
> Test 1: data-local:15 rack-local: 29
> Test 2: data-local:18 rack-local: 26
> I don't understand why there is not always 100% data local. This should
> not be a problem since the blocks of my input file are distributed over all
> nodes.
> Maybe someone can give me a hint.
> Thanks,
> André Hacker, TU Berlin

View raw message