hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jay vyas <jayunit100.apa...@gmail.com>
Subject Re: Need some help with RecordReader
Date Wed, 29 Oct 2014 01:15:27 GMT
great question .  i like the idea of using the existing FASTA Record Reader
if it works for you.
In general, you should know that this isnt too hard: If you want to
implement your own - here is how:

Yes, your right that a file typically has delimiters at the end of records,
and so it makes sense that FASTA is problematic for this.

The signature for a record reader is something like this:

* public RecordReader<Text, Text> createRecordReader(InputSplit
arg0,TaskAttemptContext arg1) throws IOException,
InterruptedException{*

Thus ,  a record reader has the WHOLE split as its input.

So,  the record reader can easily start reading the file, when it sees the
+++++ demarkation, it can break off a new record, remembering where it is,
and then begin reading again.

You unfortunately wont be able to extend KeyValueLineRecordReader, but
instead, youll have to write a record reader which is somewhat similar to
LineRecordReader, but only you'll have to replace the "readLine" call with
something a little more intelligent (i.e. youll have to keep reading till
you see the next record, return the finished sequence, and then start
assembling the next sequence , until the file is extinguished).

So as a start you will want to copy LineRecordReader and compile it to
ensure that its working in your java setup, and then get it working with
the FASTA files,.





On Tue, Oct 28, 2014 at 5:08 PM, John Dison <jdison16@yahoo.com> wrote:

> Hello!
>
> I have a file in the following format:
> +++++ InvoiceNo=1
> some
> text1
> +++++ InvoiceNo=2
> some
> more
> text2
> <...>
>
> Each record starts with a line beginning with five "+", then number of
> invoice.
> Then several lines of text.
> I want the invoice number to become a key for Map operation, and the text
> to become a value.
>
> As far as I understand, I need to implement some kind of custom
> RecordReader class to parse that format.  But all examples I found on the
> Internet deal with formats where there is some mark at the end of the
> record, but in my case I only can see that records ended after reading the
> first line of the next record.
>
> I would be very thankful for any help with implementing such a
> RecordReader.
>
> Thanks in advance,
> John.
>



-- 
jay vyas

Mime
View raw message