hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jingkei Ly" <Jingkei...@detica.com>
Subject RE: Custom InputFormat/OutputFormat
Date Thu, 10 Jul 2008 16:58:48 GMT
I think I see now. Just to recap... you are right that TextOutputFormat
outputs Key\tValue\n, which in your case gives:

But as your Text_block contains '\n' your output actually comes out as:

Key					Value
-------				-------------
file_position			first_line_in_text_block
second_line_in_text_block	NOVALUE
third_line_in_text_block	NOVALUE ...

As I mentioned in my other reply, I think you need to write your own
OutputFormat to get the output file exactly how you want (perhaps
something like LineRecordWriter which doesn't write the key out and
outputs a separator of your choosing between each record).

-----Original Message-----
From: Francesco Tamberi [mailto:tamber@cli.di.unipi.it] 
Sent: 10 July 2008 17:15
To: core-user@hadoop.apache.org
Subject: Re: Custom InputFormat/OutputFormat

Ok, I would not like to annoy you but I think I'm missing something..
I have to:
- extract relevant text blocks from really big document (<doc id= .....>
- apply some python/c/c++ functions as mappers to text blocks (called
via shell script)
- output processed text back to text file

In order to do that I:
- wrote a CustomInputFormat that creates [File_position / Text_block]
tuples as key/values and
- invoked hadoop without reduce phase (-jobconf mapred.reduce.tasks=0)
'cause I don't want my output to be sorted/grouped.

As far as I can see the write method of LineRecordWriter class in
TextOutputFormat just writes (if not nulls) Key\tValue so I thought
that, using "cat" as mapper for testing the CustomInputFormat, the
result should be:

Instead, as you already know,  I got a tuple for evey line, like that:

file_position / first_line_in_text_block second_line_in_text_block /
NOVALUE third_line_in_text_block / NOVALUE ...

What am I missing?
Thank you for your patience..

Jingkei Ly ha scritto:
> I think I need to understand what you are trying to achieve better, so

> apologies if these two options don't answer your question fully!
> 1) If you want to operate on the text in the reducer, then you won't 
> need to make any changes as the data between mapper and reducer is 
> stored as SequenceFiles so won't suffer from records being delimited 
> by newline characters. So the input to the reducer will see records in

> the
> form:
> Key: file_pos
> Value: all your text with newlines preserved
> 2) If, however, you are more interested in outputting human-readable 
> plain-text files with the specifications you want at the end of your 
> MapReduce program you will probably need to implement your own 
> OutputFormat which does not output the key, and does not use newline 
> characters to separate records. I would suggest looking at 
> TextOutputFormat to start.
> HTH,
> Jingkei
> -----Original Message-----
> From: Francesco Tamberi [mailto:tamber@cli.di.unipi.it]
> Sent: 10 July 2008 14:17
> To: core-user@hadoop.apache.org
> Subject: Re: Custom InputFormat/OutputFormat
> Thank you so much.
> The problem is that I need to operate on text as is, without 
> modification, and I don't want the filepos to be outputted.
> There's no way in hadoop to map and to output a block of text 
> containing newline characters?
> Thank you again,
> Francesco
> Jingkei Ly ha scritto:
>> I think you need to strip out the newline characters in the value you

>> return, as the TextOutputFormat will treat each newline character as 
>> the start of a new record.
>> -----Original Message-----
>> From: Francesco Tamberi [mailto:tamber@cli.di.unipi.it]
>> Sent: 09 July 2008 11:27
>> To: core-user@hadoop.apache.org
>> Subject: Custom InputFormat/OutputFormat
>> Hi all,
>> I want to use hadoop for some streaming text processing on text 
>> documents like:
>> <doc id=... ... ... >
>> text text
>> text
>> ...
>> </doc>
>> Just xml-like notation but not real xml files.
>> I have to work on text included between <doc> tags, so I implemented 
>> an InputFormat (extending FileInputFormat) with a RecordReader that 
>> return file position as Key and needed text as Value.
>> This is next method and I'm pretty sure that it works as expected..
>> /** Read a text block. */
>>         public synchronized boolean next(LongWritable key, Text 
>> value)
>> throws IOException
>>         {
>>             if (pos >= end)
>>                 return false;
>>             key.set(pos); // key is position
>>             buffer.reset();
>>             long bytesRead = readBlock(startTag, endTag); // put 
>> needed text in buffer
>>             if (bytesRead == 0)
>>                 return false;
>>             pos += bytesRead;
>>             value.set(buffer.getData(), 0, buffer.getLength());
>>             return true;
>>         }
>> But when I test it, using "cat" as mapper function and 
>> TextOutputFormat as OutputFormat, I have one key/value per line:
>> For every text block, the first tuple has fileposition as key and 
>> text
>> as value, remaining have text as key and no value... ie:
>> file_pos / first_line
>> second_line /
>> third_line /
>> ...
>> Where am I wrong?
>> Thank you in advance,
>> Francesco
>> This message should be regarded as confidential. If you have received
> this email in error please notify the sender and destroy it
>> Statements of intent shall only become binding when confirmed in hard
> copy by an authorised signatory.  The contents of this email may 
> relate to dealings with other companies within the Detica Group plc 
> group of companies.
>> Detica Limited is registered in England under No: 1337451.
>> Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP,
> England.

View raw message