hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasan Ary <voicesnthed...@yahoo.com>
Subject RE: streaming + binary input/output data?
Date Mon, 14 Apr 2008 21:32:17 GMT
  That's an interesting approach, but isn't it possible that an equivalent \n might get encoded
in the binary data?

John Menzer <standard00@gmx.net> wrote:
so you mean you changed the hadoop streaming source code?
actually i am not really willing to change the source code if it's not

so i thought about simply encoding the input binary data to txt (e.g. with
base64) and then adding a '\n' after each line to make it splittable for
after reading from stdin my C programm would just have to decode it
map/reduce it and then encode it back to base64 so write to stdout.

what do you think about that? worth a try?

Joydeep Sen Sarma wrote:
> actually - this is possible - but changes to streaming are required.
> at one point - we had gotten rid of the '\n' and '\t' separators between
> the keys and the values in the streaming code and streamed byte arrays
> directly to scripts (and then decoded them in the script). it worked
> perfectly fine. (in fact we were streaming thrift generated byte streams -
> encoded in java land and decoded in python land :-))
> the binary data on hdfs is best stored as sequencefiles (if u store binary
> data in (what looks to hadoop as) a text file - then bad things will
> happen). if stored this way - hadoop doesn't care about newlines and tabs
> - those are purely artifacts of streaming.
> also - the streaming code (for unknown reasons) doesn't allow a
> SequencefileInputFormat. there were minor tweaks we had to make to the
> streaming driver to allow this stuff ..
> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com]
> Sent: Mon 4/7/2008 7:43 AM
> To: core-user@hadoop.apache.org
> Subject: Re: streaming + binary input/output data?
> I don't think that binary input works with streaming because of the
> assumption of one record per line.
> If you want to script map-reduce programs, would you be open to a Groovy
> implementation that avoids these problems?
> On 4/7/08 6:42 AM, "John Menzer" wrote:
>> hi,
>> i would like to use binary input and output data in combination with
>> hadoop
>> streaming.
>> the reason why i want to use binary data is, that parsing text to float
>> seems to consume a big lot of time compared to directly reading the
>> binary
>> floats.
>> i am using a C-coded mapper (getting streaming data from stdin and
>> writing
>> to stdout) and no reducer.
>> so my question is: how do i implement binary input output in this
>> context?
>> as far as i understand i need to put an '\n' char at the end of each
>> binary-'line'. so hadoop knows how to split/distribute the input data
>> among
>> the nodes and how to collect it for output(??)
>> is this approach reasonable?
>> thanks,
>> john

View this message in context: http://www.nabble.com/streaming-%2B-binary-input-output-data--tp16537427p16656661.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message