hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shevek <she...@karmasphere.com>
Subject Re: data locality
Date Tue, 25 Oct 2011 22:13:53 GMT
We pray to $deity that the mapreduce block size is about the same as (or
smaller than) the hdfs block size. We also pray that file format
synchronization points are frequent when compared to block boundaries.

The JobClient finds the location of each block of each file. It splits the
job into FileSplit(s), with one per block.

Each FileSplit is processed by a task. The Split contains the locations in
which the task should best be run.

The last block may be very short. It is then subsumed into the preceding

Some data is transferred between nodes when the synchronization point for
the file format is not at a block boundary. (It basically never is, but we
hope it's close, or the purpose of MR locality is defeated.)

Specifically to your questions: Most of the data should be read from the
local hdfs node under the above assumptions. The communication layer between
mapreduce and hdfs is not special.


On 25 October 2011 11:49, <Ivan.Novick@emc.com> wrote:

> Hello,
> I am trying to understand how data locality works in hadoop.
> If you run a map reduce job do the mappers only read data from the host on
> which they are running?
> Is there a communication protocol between the map reduce layer and HDFS
> layer so that the mapper gets optimized to read data locally?
> Any pointers on which layer of the stack handles this?
> Cheers,
> Ivan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message