hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Custom InputFormat errer
Date Thu, 30 Aug 2012 02:49:34 GMT
No, what I mean is that your RecordReader should be able to handle a
case where it may start from middle of a record and hence not be able
to read any record (i.e. return false or whatever right up front).

On Wed, Aug 29, 2012 at 1:27 PM, Chen He <airbots@gmail.com> wrote:
> Hi Harsh
>
> Thank you for your reply. Do you mean I need to change the FileSplit to
> avoid those errors I mentioned happen?
>
> Regards!
>
> Chen
>
> On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <harsh@cloudera.com> wrote:
>>
>> Hi Chen,
>>
>> Does your record reader and mapper handle the case where one map split
>> may not exactly get the whole record? Your case is not very different
>> from the newlines logic presented here:
>> http://wiki.apache.org/hadoop/HadoopMapReduce
>>
>> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <airbots@gmail.com> wrote:
>> > Hi guys
>> >
>> > I met a interesting problem when I implement my own custom InputFormat
>> > which
>> > extends the FileInputFormat.(I rewrite the RecordReader class but not
>> > the
>> > InputSplit class)
>> >
>> > My recordreader will take following format as a basic record: (my
>> > recordreader extends the LineRecordReader. It returns a record if it
>> > meets
>> > #Trailer# and contains #Header#. I only have one input file that is
>> > composed
>> > of many of following basic record)
>> >
>> > #Header#
>> > .....(many lines, may be 0 lines or 1000 lines, it varies)
>> > #Trailer#
>> >
>> > Everything works fine if above basic input unit in a file is integer
>> > times
>> > of mapper. For example, I use 2 mappers and there are two basic records
>> > in
>> > my input file. Or I use 3 mappers and there are 6 basic units in the
>> > input
>> > file.
>> >
>> > However, if I use 4 mappers but there are 3 basic units in the input
>> > file(not integer times). The final output is incorrect. The "Map Input
>> > Bytes" in the job counter is also less than the input file size. How can
>> > I
>> > fix it? Do I need to rewrite the inputSplit?
>> >
>> > Any reply will be appreciated!
>> >
>> > Regards!
>> >
>> > Chen
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Mime
View raw message