hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Runping Qi <runp...@yahoo-inc.com>
Subject Re: Custom InputFormat/OutputFormat
Date Thu, 10 Jul 2008 18:10:23 GMT
All this is because you were using streaming.
Streaming treats each line in the stream as one "record" and then break it
into a key/value pair (using '\t' as the separator by default).
If you write your mapper class in Java, the values passed to the calls to
your map function should be the whole text blocks your input record reader
extracts. Your map function should have the logic to process the text blocks
and output the appropriate key/value pairs.



Runping


On 7/10/08 10:41 AM, "Francesco Tamberi" <tamber@cli.di.unipi.it> wrote:

> Really thanks,
> but I still cannot understand why lines after the first one become a
> key.. why it happens? Shouldn't they be still Value's part??
> 
> I implemented a CustomOutputFormat that writes Values only out and I got:
> 
> first_line_in_text_block
> EOF
> 
> I tried outputting Key only and I got:
> 
> second_line_in_text_block
> third_line_in_text_block
> ...
> last_line_in_text_block
> EOF
> 
> So it seems there's no way to go on... and it's seems impossible to me..
> Any hints?
> 
> Thank you again,
> Francesco
> 
> 
> Jingkei Ly ha scritto:
>> I think I see now. Just to recap... you are right that TextOutputFormat
>> outputs Key\tValue\n, which in your case gives:
>> File_position\tText_block\n.
>> 
>> But as your Text_block contains '\n' your output actually comes out as:
>> 
>> Key     Value
>> -------    -------------
>> file_position   first_line_in_text_block
>> second_line_in_text_block NOVALUE
>> third_line_in_text_block NOVALUE ...
>> 
>> As I mentioned in my other reply, I think you need to write your own
>> OutputFormat to get the output file exactly how you want (perhaps
>> something like LineRecordWriter which doesn't write the key out and
>> outputs a separator of your choosing between each record).
>> 
>> 
>> -----Original Message-----
>> From: Francesco Tamberi [mailto:tamber@cli.di.unipi.it]
>> Sent: 10 July 2008 17:15
>> To: core-user@hadoop.apache.org
>> Subject: Re: Custom InputFormat/OutputFormat
>> 
>> Ok, I would not like to annoy you but I think I'm missing something..
>> I have to:
>> - extract relevant text blocks from really big document (<doc id= .....>
>> TEXTBLOCK </doc>)
>> - apply some python/c/c++ functions as mappers to text blocks (called
>> via shell script)
>> - output processed text back to text file
>> 
>> In order to do that I:
>> - wrote a CustomInputFormat that creates [File_position / Text_block]
>> tuples as key/values and
>> - invoked hadoop without reduce phase (-jobconf mapred.reduce.tasks=0)
>> 'cause I don't want my output to be sorted/grouped.
>> 
>> As far as I can see the write method of LineRecordWriter class in
>> TextOutputFormat just writes (if not nulls) Key\tValue so I thought
>> that, using "cat" as mapper for testing the CustomInputFormat, the
>> result should be:
>> File_position\tText_block\n
>> 
>> Instead, as you already know,  I got a tuple for evey line, like that:
>> 
>> file_position / first_line_in_text_block second_line_in_text_block /
>> NOVALUE third_line_in_text_block / NOVALUE ...
>> 
>> What am I missing?
>> Thank you for your patience..
>> Francesco
>> 
>> Jingkei Ly ha scritto:
>>   
>>> I think I need to understand what you are trying to achieve better, so
>>>     
>> 
>>   
>>> apologies if these two options don't answer your question fully!
>>> 
>>> 1) If you want to operate on the text in the reducer, then you won't
>>> need to make any changes as the data between mapper and reducer is
>>> stored as SequenceFiles so won't suffer from records being delimited
>>> by newline characters. So the input to the reducer will see records in
>>>     
>> 
>>   
>>> the
>>> form:
>>>  
>>> Key: file_pos
>>> Value: all your text with newlines preserved
>>> 
>>> 2) If, however, you are more interested in outputting human-readable
>>> plain-text files with the specifications you want at the end of your
>>> MapReduce program you will probably need to implement your own
>>> OutputFormat which does not output the key, and does not use newline
>>> characters to separate records. I would suggest looking at
>>> TextOutputFormat to start.
>>> 
>>> HTH,
>>> Jingkei
>>> 
>>> -----Original Message-----
>>> From: Francesco Tamberi [mailto:tamber@cli.di.unipi.it]
>>> Sent: 10 July 2008 14:17
>>> To: core-user@hadoop.apache.org
>>> Subject: Re: Custom InputFormat/OutputFormat
>>> 
>>> Thank you so much.
>>> The problem is that I need to operate on text as is, without
>>> modification, and I don't want the filepos to be outputted.
>>> There's no way in hadoop to map and to output a block of text
>>> containing newline characters?
>>> Thank you again,
>>> Francesco
>>> 
>>> Jingkei Ly ha scritto:
>>>   
>>>     
>>>> I think you need to strip out the newline characters in the value you
>>>>       
>> 
>>   
>>>> return, as the TextOutputFormat will treat each newline character as
>>>> the start of a new record.
>>>> 
>>>> -----Original Message-----
>>>> From: Francesco Tamberi [mailto:tamber@cli.di.unipi.it]
>>>> Sent: 09 July 2008 11:27
>>>> To: core-user@hadoop.apache.org
>>>> Subject: Custom InputFormat/OutputFormat
>>>> 
>>>> Hi all,
>>>> I want to use hadoop for some streaming text processing on text
>>>> documents like:
>>>> 
>>>> <doc id=... ... ... >
>>>> text text
>>>> text
>>>> ...
>>>> </doc>
>>>> 
>>>> 
>>>> Just xml-like notation but not real xml files.
>>>> 
>>>> I have to work on text included between <doc> tags, so I implemented
>>>> an InputFormat (extending FileInputFormat) with a RecordReader that
>>>> return file position as Key and needed text as Value.
>>>> This is next method and I'm pretty sure that it works as expected..
>>>> 
>>>> /** Read a text block. */
>>>>         public synchronized boolean next(LongWritable key, Text
>>>> value)
>>>>     
>>>>       
>>>   
>>>     
>>>> throws IOException
>>>>         {
>>>>             if (pos >= end)
>>>>                 return false;
>>>> 
>>>>             key.set(pos); // key is position
>>>>             buffer.reset();
>>>>             long bytesRead = readBlock(startTag, endTag); // put
>>>> needed text in buffer
>>>>             if (bytesRead == 0)
>>>>                 return false;
>>>>            
>>>>             pos += bytesRead;
>>>>             value.set(buffer.getData(), 0, buffer.getLength());
>>>>             return true;
>>>>         }
>>>> 
>>>> But when I test it, using "cat" as mapper function and
>>>> TextOutputFormat as OutputFormat, I have one key/value per line:
>>>> For every text block, the first tuple has fileposition as key and
>>>> text
>>>>     
>>>>       
>>>   
>>>     
>>>> as value, remaining have text as key and no value... ie:
>>>> 
>>>> file_pos / first_line
>>>> second_line /
>>>> third_line /
>>>> ...
>>>> 
>>>> Where am I wrong?
>>>> 
>>>> Thank you in advance,
>>>> Francesco
>>>> 
>>>> 
>>>> 
>>>> This message should be regarded as confidential. If you have received
>>>>     
>>>>       
>>> this email in error please notify the sender and destroy it
>>>     
>> immediately.
>>   
>>>   
>>>     
>>>> Statements of intent shall only become binding when confirmed in hard
>>>>     
>>>>       
>>> copy by an authorised signatory.  The contents of this email may
>>> relate to dealings with other companies within the Detica Group plc
>>> group of companies.
>>>   
>>>     
>>>> Detica Limited is registered in England under No: 1337451.
>>>> 
>>>> Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP,
>>>>     
>>>>       
>>> England.
>>>   
>>>     
>>>>   
>>>>     
>>>>       
>>>   
>>>     
>> 
>>   
> 


Mime
View raw message