hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Evans <ev...@yahoo-inc.com>
Subject Re: Hadoop streaming or pipes ..
Date Thu, 05 Apr 2012 19:49:22 GMT
Both streaming and pipes do very similar things.  They will fork/exec a separate process that
is running whatever you want it to run.  The JVM that is running hadoop then communicates
with this process to send the data over and get the processing results back.  The difference
between streaming and pipes is that streaming uses stdin/stdout for this communication so
preexisting processing like grep, sed and awk can be used here.  Pipes uses a custom protocol
with a C++ library to communicate.  The C++ library is tagged with SWIG compatible data so
that it can be wrapped to have APIs in other languages like python or perl.

I am not sure what the performance difference is between the two, but in my own work I have
seen a significant performance penalty from using either of them, because there is a somewhat
large overhead of sending all of the data out to a separate process just to read it back in
again.

--Bobby Evans


On 4/5/12 1:54 PM, "Mark question" <markq2011@gmail.com> wrote:

Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
Java? From what I've read, it's only to ease testing by using your favorite
language. So I guess it is eventually translated to bytecode then executed.
Is that true?

Thank you,
Mark


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message