hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Cen <cenyo...@gmail.com>
Subject Re: multi-line records and file splits
Date Wed, 06 May 2009 01:06:22 GMT
I think your SDFInputFormat should implement the MultiFileInputFormat
instead of the TextInputFormat, which will not splid the file into chunk.

2009/5/6 Rajarshi Guha <rguha@indiana.edu>

> Hi, I have implemented a subclass of RecordReader to handle a plain text
> file format where a record is multi-line and of variable length.
> Schematically each record is of the form
>
> some_title
> foo
> bar
> $$$$
> another_title
> foo
> foo
> bar
> $$$$
>
> where $$$$ is the marker for the end of the record. My code is at
> http://blog.rguha.net/?p=293 and it seems to work fine on my input data.
>
> However, I realized that when I run the program, Hadoop will 'chunk' the
> input file. As a result, the SDFRecordReader might get a chunk of input
> text, such that the last record is actually incomplete (a missing $$$$). Is
> this correct?
>
> If so, how would the RecordReader implementation recover from this
> situation? Or is there a way to indicate to Hadoop that the input file
> should be chunked keeping in mind end of record delimiters?
>
> Thanks
>
> -------------------------------------------------------------------
> Rajarshi Guha  <rguha@indiana.edu>
> GPG Fingerprint: D070 5427 CC5B 7938 929C  DD13 66A1 922C 51E7 9E84
> -------------------------------------------------------------------
> Q:  What's polite and works for the phone company?
> A:  A deferential operator.
>
>
>


-- 
http://daily.appspot.com/food/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message