hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Francesco Tamberi <tam...@cli.di.unipi.it>
Subject Custom InputFormat/OutputFormat
Date Wed, 09 Jul 2008 10:26:43 GMT
Hi all,
I want to use hadoop for some streaming text processing on text 
documents like:

<doc id=... ... ... >
text text
text
...
</doc>


Just xml-like notation but not real xml files.

I have to work on text included between <doc> tags, so I implemented an 
InputFormat (extending FileInputFormat) with a RecordReader that return 
file position as Key and needed text as Value.
This is next method and I'm pretty sure that it works as expected..

/** Read a text block. */
        public synchronized boolean next(LongWritable key, Text value) 
throws IOException
        {
            if (pos >= end)
                return false;

            key.set(pos); // key is position
            buffer.reset();
            long bytesRead = readBlock(startTag, endTag); // put needed 
text in buffer
            if (bytesRead == 0)
                return false;
           
            pos += bytesRead;
            value.set(buffer.getData(), 0, buffer.getLength());
            return true;
        }

But when I test it, using "cat" as mapper function and TextOutputFormat 
as OutputFormat, I have one key/value per line:
For every text block, the first tuple has fileposition as key and text 
as value, remaining have text as key and no value... ie:

file_pos / first_line
second_line /
third_line /
...

Where am I wrong?

Thank you in advance,
Francesco

Mime
View raw message