hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Virajith Jalaparti <virajit...@gmail.com>
Subject Re: Lack of data locality in Hadoop-0.20.2
Date Tue, 12 Jul 2011 22:21:05 GMT
Is the non-data local nature of the maps possible due to the amount of HDFS
data read by each map being greater than the HDFS block size? In the job I
was running, the HDFS block size dfs.block.size was 134217728 and the
HDFS_BYTES_READ by the maps was 134678218 and FILE_BYTES_READ was 134698338.

So, HDFS_BYTES_READ  is greater than dfs.block.size. Does this imply that
most of the map tasks will be non-local? Further would Hadoop ensure that
the map task is scheduled on the node which has the larger chunk of the data
that is to be read by the task?


On Tue, Jul 12, 2011 at 7:20 PM, Allen Wittenauer <aw@apache.org> wrote:

> On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote:
> > I agree that the scheduler has lesser leeway when the replication factor
> is
> > 1. However, I would still expect the number of data-local tasks to be
> more
> > than 10% even when the replication factor is 1.
>         How did you load your data?
>        Did you load it from outside the grid or from one of the datanodes?
>  If you loaded from one of the datanodes, you'll basically have no real
> locality, especially with a rep factor of 1.

View raw message