hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jothi Padmanabhan <joth...@yahoo-inc.com>
Subject Re: FileInnputFormat, FileSplit, and LineRecorder: where are they run?
Date Fri, 06 Feb 2009 03:47:45 GMT
The RecordReader code gets executed on the node in which the maps are run.
The framework tries to run maps on nodes that contain the split. However,
there is no guarantee that maps will only run on nodes that contain the
split. If a split spans multiple blocks, attempt will be made to choose a
node that contains the maximum data (across the multiple blocks) in that
split for the map to run.


On 2/6/09 2:54 AM, "Saptarshi Guha" <saptarshi.guha@gmail.com> wrote:

> Hello All,
> In order to get a better understanding of Hadoop, i've started reading
> the source and have a question
> The FileInputFormat, reads in files, splits into splitsizes (which may
> be bigger than block size) and creates FileSplits.
> The FileSplits contain the start, length *and* the locations of the split.
> The LineRecordReader, receives a split and emits records.
> So far I think i'm correct(hopefully). Now, my questions
> Does the LineRecordReader run on a machine, in some sense, closest to
> the location of the splits? i.e
> Q1: If the split is less than the block size, then the split is
> located on one machine (apart from replicates): does the
> LineRecordReader run on the machine which contains the split? Or at
> least attempt to?
> Q2. If a split is greater than the  block size, it spans multiple
> blocks which could reside on more than 1 machine. In this case, on
> which machine does the LineRecordReader run? The machine 'closest' to
> them?
> Please correct me if i'm wrong.
> Thank you
> Saptarshi

View raw message