hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi" <runp...@yahoo-inc.com>
Subject FW: streaming + binary input/output data?
Date Mon, 14 Apr 2008 22:59:08 GMT
 

Observing a few emails on this list, I think the following email
exchange between me and john may be of interest to a broader audience.

 

Runping

 

 

________________________________

From: Runping Qi 
Sent: Sunday, April 13, 2008 8:58 AM
To: 'JJ'
Subject: RE: streaming + binary input/output data?

 

 

 

That is basically what I envisioned originally.

 

One issue is the data format of streaming mapper output and the format
of streaming reducer output.

Those data are parsed by the streaming framework into key/value pairs.
The framework  assumes that the key and values are separated by tab
char, and the key/value pairs are separated by newline "\n".

That means the keys and values cannot have those two chars. If the
mapper and the reducer can encodet hose chars, then it will be fine.

Encoding the values with base64 will do it. Things related to keys are a
bit tricky, since the framework need will apply compare function on them
in order to do the sorting (and partition).

However, in most cases,  it will be acceptable to avoid binary data for
keys.

 

Another issue is to read binary input data and write binary data to dfs.

This issue can be addressed by implementing customer InputFormat and
OutputFormat classes (only the users know how to parse a specific binary
data format).

For each input key/value pair, the streaming framework basically writes
the following to the stdin  of the streaming mapper:

    Key.toString() + "\t" + value.toString() " \n"

 

As long as you implement the toString methods to ensure proper base64
encoding for the value (and the key if necessary), then you will be
fine.

 

So, in summary, all these issues can be addressed by the user's code.

Initially, I was wondering whether the framework can be extended somehow
so that the user may only need to set some configuration variables to
handle binary data.

However, it seems that it is still unclear what extension should be for
a broad classes of applications.

Maybe it is the best approach for each user to do something like what I
outlined above to address his/her specific problem.

 

Hope this helps.

 

Runping

 

 

 

________________________________

From: mailtojohnboy@gmail.com [mailto:mailtojohnboy@gmail.com] On Behalf
Of JJ
Sent: Sunday, April 13, 2008 8:18 AM
To: Runping Qi
Subject: Re: streaming + binary input/output data?

 

thx for the info,
what do you think about the idea of encoding the binary data with base64
to text before streaming it with hadoop?

John

2008/4/13, Runping Qi <runping@yahoo-inc.com>:


No implementation/solution yet.
If there are more real use cases/user interests, then somebody may have
enough interest to provide a patch.

Runping


> -----Original Message-----
> From: standard00@gmx.net [mailto:standard00@gmx.net]
> Sent: Sunday, April 13, 2008 7:30 AM
> To: Runping Qi
> Subject: RE: streaming + binary input/output data?
>
> i just read the jira. these are interestin suggestions, but how do
they
> translate into a solution for my problem/question? has all or at least
> some of this been implemented or not?
>
> thx
> John
>
> Runping Qi wrote:
> >
> >
> > Actually, there is an old jira about the same issue:
> > https://issues.apache.org/jira/browse/HADOOP-1722
> >
> > Runping
> >
> >
> >> -----Original Message-----
> >> From: John Menzer [mailto:standard00@gmx.net]
> >> Sent: Saturday, April 12, 2008 2:45 PM
> >> To: core-user@hadoop.apache.org
> >> Subject: RE: streaming + binary input/output data?
> >>
> >>
> >> so you mean you changed the hadoop streaming source code?
> >> actually i am not really willing to change the source code if it's
not
> >> necessary.
> >>
> >> so i thought about simply encoding the input binary data to txt
(e.g.
> > with
> >> base64) and then adding a '\n' after each line to make it
splittable
> > for
> >> streaming.
> >> after reading from stdin my C programm would just have to decode it
> >> map/reduce it and then encode it back to base64 so write to stdout.
> >>
> >> what do you think about that? worth a try?
> >>
> >>
> >>
> >> Joydeep Sen Sarma wrote:
> >> >
> >> > actually - this is possible - but changes to streaming are
required.
> >> >
> >> > at one point - we had gotten rid of the '\n' and '\t' separators
> > between
> >> > the keys and the values in the streaming code and streamed byte
> > arrays
> >> > directly to scripts (and then decoded them in the script). it
worked
> >> > perfectly fine. (in fact we were streaming thrift generated byte
> > streams
> >> -
> >> > encoded in java land and decoded in python land :-))
> >> >
> >> > the binary data on hdfs is best stored as sequencefiles (if u
store
> >> binary
> >> > data in (what looks to hadoop as) a text file - then bad things
will
> >> > happen). if stored this way - hadoop doesn't care about newlines
and
> >> tabs
> >> > - those are purely artifacts of streaming.
> >> >
> >> > also - the streaming code (for unknown reasons) doesn't allow a
> >> > SequencefileInputFormat. there were minor tweaks we had to make
to
> > the
> >> > streaming driver to allow this stuff ..
> >> >
> >> >
> >> > -----Original Message-----
> >> > From: Ted Dunning [mailto:tdunning@veoh.com]
> >> > Sent: Mon 4/7/2008 7:43 AM
> >> > To: core-user@hadoop.apache.org
> >> > Subject: Re: streaming + binary input/output data?
> >> >
> >> >
> >> > I don't think that binary input works with streaming because of
the
> >> > assumption of one record per line.
> >> >
> >> > If you want to script map-reduce programs, would you be open to a
> > Groovy
> >> > implementation that avoids these problems?
> >> >
> >> >
> >> > On 4/7/08 6:42 AM, "John Menzer" <standard00@gmx.net> wrote:
> >> >
> >> >>
> >> >> hi,
> >> >>
> >> >> i would like to use binary input and output data in combination
> > with
> >> >> hadoop
> >> >> streaming.
> >> >>
> >> >> the reason why i want to use binary data is, that parsing text
to
> > float
> >> >> seems to consume a big lot of time compared to directly reading
the
> >> >> binary
> >> >> floats.
> >> >>
> >> >> i am using a C-coded mapper (getting streaming data from stdin
and
> >> >> writing
> >> >> to stdout) and no reducer.
> >> >>
> >> >> so my question is: how do i implement binary input output in
this
> >> >> context?
> >> >> as far as i understand i need to put an '\n' char at the end of
> > each
> >> >> binary-'line'. so hadoop knows how to split/distribute the input
> > data
> >> >> among
> >> >> the nodes and how to collect it for output(??)
> >> >>
> >> >> is this approach reasonable?
> >> >>
> >> >> thanks,
> >> >> john
> >> >
> >> >
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> > http://www.nabble.com/streaming-%2B-binary-
> >> input-output-data--tp16537427p16656661.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
> >
> Quoted from:
> http://www.nabble.com/streaming-%2B-binary-input-output-data--
> tp16537427p16658687.html

 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message