hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasan Ary <voicesnthed...@yahoo.com>
Subject RE: streaming + binary input/output data?
Date Mon, 14 Apr 2008 22:51:27 GMT
John,
   
  My meaning didn't come through. 
   
  If you encode binary data and treat it like any peice of text going through hadoop's default
input format, at some point your binary data might have a piece that looks like 00001010,
in hex it might be 0A, and in ascii, might it not be interpreted at \N?
   
  Wouldn't you need to insure that throughout all of your binary data, that you don't have
a piece of data that might be interpreted as a \N?  
   
  You may need to define your own input format for this to work.
   
  

John Menzer <standard00@gmx.net> wrote:
  
Sure! Some equivalent should be possible. 
And like Runping already postet there have been some ideas about
implementing binary data processing in hadoop streaming:
https://issues.apache.org/jira/browse/HADOOP-1722
However this hasn't happened yet.

That's why I am looking for a minimum-effort-work-around.

Reading binary data (in my case the data are floats being processed by a
C-coded mapper) just seems to be much faster than parsing them from txt (to
float). 

I am going to implement a base64 version to find out whether it's still
faster than text-parsing.

John



Pra wrote:
> 
> John,
> 
> That's an interesting approach, but isn't it possible that an equivalent
> \n might get encoded in the binary data?
> 
> John Menzer wrote:
> 
> so you mean you changed the hadoop streaming source code?
> actually i am not really willing to change the source code if it's not
> necessary.
> 
> so i thought about simply encoding the input binary data to txt (e.g. with
> base64) and then adding a '\n' after each line to make it splittable for
> streaming.
> after reading from stdin my C programm would just have to decode it
> map/reduce it and then encode it back to base64 so write to stdout.
> 
> what do you think about that? worth a try?
> 
> 
> 
> Joydeep Sen Sarma wrote:
>> 
>> actually - this is possible - but changes to streaming are required.
>> 
>> at one point - we had gotten rid of the '\n' and '\t' separators between
>> the keys and the values in the streaming code and streamed byte arrays
>> directly to scripts (and then decoded them in the script). it worked
>> perfectly fine. (in fact we were streaming thrift generated byte streams
>> -
>> encoded in java land and decoded in python land :-))
>> 
>> the binary data on hdfs is best stored as sequencefiles (if u store
>> binary
>> data in (what looks to hadoop as) a text file - then bad things will
>> happen). if stored this way - hadoop doesn't care about newlines and tabs
>> - those are purely artifacts of streaming.
>> 
>> also - the streaming code (for unknown reasons) doesn't allow a
>> SequencefileInputFormat. there were minor tweaks we had to make to the
>> streaming driver to allow this stuff ..
>> 
>> 
>> -----Original Message-----
>> From: Ted Dunning [mailto:tdunning@veoh.com]
>> Sent: Mon 4/7/2008 7:43 AM
>> To: core-user@hadoop.apache.org
>> Subject: Re: streaming + binary input/output data?
>> 
>> 
>> I don't think that binary input works with streaming because of the
>> assumption of one record per line.
>> 
>> If you want to script map-reduce programs, would you be open to a Groovy
>> implementation that avoids these problems?
>> 
>> 
>> On 4/7/08 6:42 AM, "John Menzer" wrote:
>> 
>>> 
>>> hi,
>>> 
>>> i would like to use binary input and output data in combination with
>>> hadoop
>>> streaming.
>>> 
>>> the reason why i want to use binary data is, that parsing text to float
>>> seems to consume a big lot of time compared to directly reading the
>>> binary
>>> floats.
>>> 
>>> i am using a C-coded mapper (getting streaming data from stdin and
>>> writing
>>> to stdout) and no reducer.
>>> 
>>> so my question is: how do i implement binary input output in this
>>> context?
>>> as far as i understand i need to put an '\n' char at the end of each
>>> binary-'line'. so hadoop knows how to split/distribute the input data
>>> among
>>> the nodes and how to collect it for output(??)
>>> 
>>> is this approach reasonable?
>>> 
>>> thanks,
>>> john
>> 
>> 
>> 
>> 
> 
> -- 
> View this message in context:
> http://www.nabble.com/streaming-%2B-binary-input-output-data--tp16537427p16656661.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/streaming-%2B-binary-input-output-data--tp16537427p16691343.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



        
Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message