hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Lewis <lordjoe2...@gmail.com>
Subject Re: Need some help with RecordReader
Date Tue, 28 Oct 2014 21:35:17 GMT
This InputFormat reads a Fasta file (See below)
Format is a line starting >
plus N lines of Data

The projects in
https://code.google.com/p/distributed-tools/

Have other samples of more complex input formats


>YDR356W SPC110 SGDID:S000002764, Chr IV from 1186099-1188933, Verified
ORF, "Inner plaque spindle pole body (SPB) component, ortholog of human
kendrin; involved in connecting nuclear microtubules to SPB; interacts with
Tub4p-complex and calmodulin; phosphorylated by Mps1p in cell
cycle-dependent manner"
MDEASHLPNGSLKNMEFTPVGFIKSKRNTTQTQVVSPTKVPNANNGDENEGPVKKRQRRS
IDDTIDSTRLFSEASQFDDSFPEIKANIPPSPRSGNVDKSRKRNLIDDLKKDVPMSQPLK
EQEVREHQMKKERFDRALESKLLGKRHITYANSDISNKELYINEIKSLKHEIKELRKEKN
DTLNNYDTLEEETDDLKNRLQALEKELDAKNKIVNSRKVDDHSGCIEEREQMERKLAELE
RRLRLDTRKGEHSLNISLPDDDELDRDYYNSHVYTRYHDYEYPLRFNLNRRGPYFERRLS
FKTVALLVLACVRMKRIAFYRRSDDNRLRILRDRIESSSGRISW
>YLR244C MAP1 SGDID:S000004234, Chr XII from 626333-625170, reverse
complement, Verified ORF, "Methionine aminopeptidase, catalyzes the
cotranslational removal of N-terminal methionine from nascent polypeptides;
function is partially redundant with that of Map2p"
MSTATTTVTTSDQASHPTKIYCSGLQCGRETSSQMKCPVCLKQGIVSIFCDTSCYENNYK
AHKALHNAKDGLEGAYDPFPKFKYSGKVKASYPLTPRRYVPEDIPKPDWAANGLPVSEQR
NDRLNNIPIYKKDQIKKIRKACMLGREVLDIAAAHVRPGITTDELDEIVHNETIKRGAYP
SPLNYYNFPKSLCTSVNEVICHGVPDKTVLKEGDIVNLDVSLYYQGYHADLNETYYVGEN
ISKEALNTTETSRECLKLAIKMCKPGTTFQELGDHIEKHATENKCSVVRTYCGHGVGEFF
HCSPNIPHYAKNRTPGVMKPGMVFTIEPMINEGTWKDMTWPDDWTSTTQDGKLSAQFEHT
LLVTEHGVEILTARNKKSPGGPRQRIK
>REV1_YJL076W NET1 SGDID:S000003612, Chr X from 295162-298731, Verified
ORF, "Core subunit of the RENT complex, which is a complex involved in
nucleolar silencing and telophase exit; stimulates transcription by RNA
polymerase I and regulates nucleolar structure"
MYKNPLLQSSEAITPGYGFQIPMTAQLSPPVLVVQLRLNAYQLSADGASQAMNTRSQNFYSPTFSVNASRFRKTFLLFKPDIIEDSLNLLTNTKECKVLFDPDLDCGSNDQLSLIEIDEQLSPYMKVINNVNFVDRLIVKYLSVPASDDLDIENKVSKRSKLVGSSSPIQQQPQVSQPSGNNLRAIKKRPITTTTTTGTPRMSGNTASRALPTSVRSSPPPYIQKEGIDEDEDDSNNSVIRIPPSQPQTPPPLFSRGADIGSSIKKIKSVIDEEVISSRDPDVTASKTKQQRNPTMTSMIPTGSLLRQGTLTVRHAHESVVKNIDQATVAATGGNAFSSSSASASFVLENRKPVPTVPRLMGSTIKIPIPREIESIKL
SSDSVSDSSSNSDSDSSSEDDSSSPAKGDDSSDGSDDSDSESKASIFSKGLAASASKKKKPILSAFGGSKFDKKK
>YJL077W-A YJL077W-A SGDID:S000028661, Chr X from 294716-294802, Dubious
ORF, "Identified by gene-trapping, microarray-based expression analysis,
and genome-wide homology searching"
MPGIAFKGKDMVKAIQFLEIVVPCHCTT




> Some Comment


On Tue, Oct 28, 2014 at 2:08 PM, John Dison <jdison16@yahoo.com> wrote:

> Hello!
>
> I have a file in the following format:
> +++++ InvoiceNo=1
> some
> text1
> +++++ InvoiceNo=2
> some
> more
> text2
> <...>
>
> Each record starts with a line beginning with five "+", then number of
> invoice.
> Then several lines of text.
> I want the invoice number to become a key for Map operation, and the text
> to become a value.
>
> As far as I understand, I need to implement some kind of custom
> RecordReader class to parse that format.  But all examples I found on the
> Internet deal with formats where there is some mark at the end of the
> record, but in my case I only can see that records ended after reading the
> first line of the next record.
>
> I would be very thankful for any help with implementing such a
> RecordReader.
>
> Thanks in advance,
> John.
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Mime
View raw message