hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jingkei Ly" <Jingkei...@detica.com>
Subject RE: Custom InputFormat/OutputFormat
Date Thu, 10 Jul 2008 10:05:17 GMT
I think you need to strip out the newline characters in the value you
return, as the TextOutputFormat will treat each newline character as the
start of a new record.

-----Original Message-----
From: Francesco Tamberi [mailto:tamber@cli.di.unipi.it] 
Sent: 09 July 2008 11:27
To: core-user@hadoop.apache.org
Subject: Custom InputFormat/OutputFormat

Hi all,
I want to use hadoop for some streaming text processing on text
documents like:

<doc id=... ... ... >
text text

Just xml-like notation but not real xml files.

I have to work on text included between <doc> tags, so I implemented an
InputFormat (extending FileInputFormat) with a RecordReader that return
file position as Key and needed text as Value.
This is next method and I'm pretty sure that it works as expected..

/** Read a text block. */
        public synchronized boolean next(LongWritable key, Text value)
throws IOException
            if (pos >= end)
                return false;

            key.set(pos); // key is position
            long bytesRead = readBlock(startTag, endTag); // put needed
text in buffer
            if (bytesRead == 0)
                return false;
            pos += bytesRead;
            value.set(buffer.getData(), 0, buffer.getLength());
            return true;

But when I test it, using "cat" as mapper function and TextOutputFormat
as OutputFormat, I have one key/value per line:
For every text block, the first tuple has fileposition as key and text
as value, remaining have text as key and no value... ie:

file_pos / first_line
second_line /
third_line /

Where am I wrong?

Thank you in advance,

This message should be regarded as confidential. If you have received this email in error
please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard copy by an authorised
signatory.  The contents of this email may relate to dealings with other companies within
the Detica Group plc group of companies.

Detica Limited is registered in England under No: 1337451.

Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.

View raw message