hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Ivan.Nov...@emc.com>
Subject Re: data locality
Date Wed, 26 Oct 2011 00:36:43 GMT
So I guess the job tracker is the one reading the HDFS meta-data and then
optimizing the scheduling of map jobs based on that?

On 10/25/11 3:13 PM, "Shevek" <shevek@karmasphere.com> wrote:

>We pray to $deity that the mapreduce block size is about the same as (or
>smaller than) the hdfs block size. We also pray that file format
>synchronization points are frequent when compared to block boundaries.
>The JobClient finds the location of each block of each file. It splits the
>job into FileSplit(s), with one per block.
>Each FileSplit is processed by a task. The Split contains the locations in
>which the task should best be run.
>The last block may be very short. It is then subsumed into the preceding
>Some data is transferred between nodes when the synchronization point for
>the file format is not at a block boundary. (It basically never is, but we
>hope it's close, or the purpose of MR locality is defeated.)
>Specifically to your questions: Most of the data should be read from the
>local hdfs node under the above assumptions. The communication layer
>mapreduce and hdfs is not special.
>On 25 October 2011 11:49, <Ivan.Novick@emc.com> wrote:
>> Hello,
>> I am trying to understand how data locality works in hadoop.
>> If you run a map reduce job do the mappers only read data from the host
>> which they are running?
>> Is there a communication protocol between the map reduce layer and HDFS
>> layer so that the mapper gets optimized to read data locally?
>> Any pointers on which layer of the stack handles this?
>> Cheers,
>> Ivan

View raw message