hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Juho Mäkinen" <juho.maki...@gmail.com>
Subject Implementing own InputFormat and RecordReader
Date Mon, 15 Sep 2008 13:13:38 GMT
I'm trying to implement my own InputFormat and RecordReader for my
data and I'm stuck as I can't find enough documentation about the

My input format is a tightly packed binary data consisting individual
event packets. Each event packet contains its length and the packets
are simply appended into end of an file. Thus the file must be read as
stream and it cannot be splitted.

FileInputFormat seems like a reasonable place to start but I
immediately ran into problems and unanswered questions:

1) The FileInputFormat.getSplits() returns InputSplit[] array. If my
input file is 128MB and my HDFS block size is 64MB, will it return one
InputSplit or two InputSplits?

2) If my file is splitted into two or more filesystem blocks, how will
hadoop handle the reading of those blocks? As the file must be read in
sequence, will hadoop first copy every block to a machine (if the
blocks aren't already in there) and then start the mapper in this
machine? Do I need to handle the reading and opening multiple blocks,
or will hadoop provide me a simple stream interface which I can use to
read the entire file without worrying if the file is larger than the
HDFS block size?

 - Juho Mäkinen

View raw message