hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saptarshi Guha <saptarshi.g...@gmail.com>
Subject FileInnputFormat, FileSplit, and LineRecorder: where are they run?
Date Thu, 05 Feb 2009 21:24:33 GMT
Hello All,
In order to get a better understanding of Hadoop, i've started reading
the source and have a question
The FileInputFormat, reads in files, splits into splitsizes (which may
be bigger than block size) and creates FileSplits.
The FileSplits contain the start, length *and* the locations of the split.
The LineRecordReader, receives a split and emits records.

So far I think i'm correct(hopefully). Now, my questions
Does the LineRecordReader run on a machine, in some sense, closest to
the location of the splits? i.e
Q1: If the split is less than the block size, then the split is
located on one machine (apart from replicates): does the
LineRecordReader run on the machine which contains the split? Or at
least attempt to?
Q2. If a split is greater than the  block size, it spans multiple
blocks which could reside on more than 1 machine. In this case, on
which machine does the LineRecordReader run? The machine 'closest' to

Please correct me if i'm wrong.
Thank you

Saptarshi Guha - saptarshi.guha@gmail.com

View raw message