hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From harry lippy <harryli...@gmail.com>
Subject Question about how input data is presented to the map function
Date Fri, 16 Sep 2011 13:26:35 GMT
Hi from a total noob:

I'm working my way through 'Hadoop:  The Definitive Guide', by Tom White.
 In chapter 2, he works through an example of taking weather data from the
NCDC and computing the maximum temperature for the given years.  There is a
small sample test file to try out the code, and it looks like:

0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999

In the middle of page 19, he says

"These lines are presented to the map function as the key-value pairs:

(0,
0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999)
(106,
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999)
(212,
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999)
(318,
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999)
(424,
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999)

The keys are file offsets into the input file.  My question:  how did the
'are presented to the map function as key-value pairs' happen?  I've run the
example on the input file using the java Mapper, Reducer, and the code that
runs the job - none of which seems, to my novice eye, to massage the input
from the file to the map function in the (file offset, line of data from
file) key-value format - and the results are correct.

Does hadoop automagically create key-value pairs of this format (file
offset, line of data from file)?  If so, is there a way to get hadoop to
present the data to the map function in a different format?

I should probably finish reading the book, as my question will probably be
answered there, but I hate moving forward with the feeling that I am missing
something.

Thanks, everybody!

Shaun

Mime
View raw message