hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Custom InputFormat errer
Date Wed, 29 Aug 2012 07:46:06 GMT
Hi Chen,

Does your record reader and mapper handle the case where one map split
may not exactly get the whole record? Your case is not very different
from the newlines logic presented here:

On Wed, Aug 29, 2012 at 11:13 AM, Chen He <airbots@gmail.com> wrote:
> Hi guys
> I met a interesting problem when I implement my own custom InputFormat which
> extends the FileInputFormat.(I rewrite the RecordReader class but not the
> InputSplit class)
> My recordreader will take following format as a basic record: (my
> recordreader extends the LineRecordReader. It returns a record if it meets
> #Trailer# and contains #Header#. I only have one input file that is composed
> of many of following basic record)
> #Header#
> .....(many lines, may be 0 lines or 1000 lines, it varies)
> #Trailer#
> Everything works fine if above basic input unit in a file is integer times
> of mapper. For example, I use 2 mappers and there are two basic records in
> my input file. Or I use 3 mappers and there are 6 basic units in the input
> file.
> However, if I use 4 mappers but there are 3 basic units in the input
> file(not integer times). The final output is incorrect. The "Map Input
> Bytes" in the job counter is also less than the input file size. How can I
> fix it? Do I need to rewrite the inputSplit?
> Any reply will be appreciated!
> Regards!
> Chen

Harsh J

View raw message