hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Hammerbacher" <jeff.hammerbac...@gmail.com>
Subject Re: Hadoop streaming performance problem
Date Mon, 31 Mar 2008 22:38:57 GMT
We have 40 or so engineers who only use streaming at Facebook, so no
matter how jury-rigged the solution might be, it's been immensely
valuable for developer productivity.  As we found with Thrift, letting
developers write code in their language of choice has many benefits,
including development speed, reuse of common libraries, and using the
right language for the right task.  It would be interesting to bake
something like streaming into the Hadoop core rather than forcing
tab-delimited data through pipes.  I know, I know: it's open source,
go ahead and add it.  Just sayin...

On Mon, Mar 31, 2008 at 3:31 PM, Travis Brady <travis@mochimedia.com> wrote:
> This brings up two interesting issues:
>
>  1. Hadoop streaming is a potentially very powerful tool, especially for
>  those of us who don't work in Java for whatever reason
>  2. If Hadoop streaming is "at best a jury rigged solution" then that should
>  be made known somewhere on the wiki.  If it's really not supposed to be
>  used, why is it provided at all?
>
>  thanks,
>  Travis
>
>
>
>  On Mon, Mar 31, 2008 at 3:23 PM, lin <novacore@gmail.com> wrote:
>
>  > Well, we would like to use hadoop streaming because our current system is
>  > in
>  > C++ and it is easier to migrate to hadoop streaming. Also we have very
>  > strict performance requirements. Java seems to be too slow. I rewrote the
>  > first program in Java and it runs 4 to 5 times slower than the C++ one.
>  >
>  > On Mon, Mar 31, 2008 at 3:15 PM, Ted Dunning <tdunning@veoh.com> wrote:
>  >
>  > >
>  > >
>  > > Hadoop can't split a gzipped file so you will only get as many maps as
>  > you
>  > > have files.
>  > >
>  > > Why the obsession with hadoop streaming?  It is at best a jury rigged
>  > > solution.
>  > >
>  > >
>  > > On 3/31/08 3:12 PM, "lin" <novacore@gmail.com> wrote:
>  > >
>  > > > Does Hadoop automatically decompress the gzipped file? I only have a
>  > > single
>  > > > input file. Does it have to be splitted and then gzipped?
>  > > >
>  > > > I gzipped the input file and Hadoop only created one map task. Still
>  > > java is
>  > > > using more than 90% CPU.
>  > > >
>  > > > On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <
>  > andreas@kostyrka.org>
>  > > > wrote:
>  > > >
>  > > >> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
>  > > >> provide the input files gzipped. Not great difference (e.g. 50%
>  > slower
>  > > >> when not gzipped, plus it took more than twice as long to upload
the
>  > > >> data to HDFS-on-S3 in the first place), but still probably relevant.
>  > > >>
>  > > >> Andreas
>  > > >>
>  > > >> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
>  > > >>> I'm running custom map programs written in C++. What the programs
do
>  > > is
>  > > >> very
>  > > >>> simple. For example, in program 2, for each input line      
 ID
>  > node1
>  > > >> node2
>  > > >>> ... nodeN
>  > > >>> the program outputs
>  > > >>>         node1 ID
>  > > >>>         node2 ID
>  > > >>>         ...
>  > > >>>         nodeN ID
>  > > >>>
>  > > >>> Each node has 4GB to 8GB of memory. The java memory setting is
>  > > -Xmx300m.
>  > > >>>
>  > > >>> I agree that it depends on the scripts. I tried replicating the
>  > > >> computation
>  > > >>> for each input line by 10 times and saw significantly better
>  > speedup.
>  > > >> But it
>  > > >>> is still pretty bad that Hadoop streaming has such big overhead
for
>  > > >> simple
>  > > >>> programs.
>  > > >>>
>  > > >>> I also tried writing program 1 with Hadoop Java API. I got almost
>  > > 1000%
>  > > >>> speed up on the cluster.
>  > > >>>
>  > > >>> Lin
>  > > >>>
>  > > >>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <
>  > > munkey906@gmail.com>
>  > > >>> wrote:
>  > > >>>
>  > > >>>> are you running a custom map script or a standard linux command
>  > like
>  > > >> WC?
>  > > >>>>  If
>  > > >>>> custom, what does your script do?
>  > > >>>>
>  > > >>>> How much ram do you have?  what are you Java memory settings?
>  > > >>>>
>  > > >>>> I used the following setup
>  > > >>>>
>  > > >>>> 2 dual core, 16 G ram, 1000MB Java heap size on an empty
box with a
>  > 4
>  > > >> task
>  > > >>>> max.
>  > > >>>>
>  > > >>>> I got the following results
>  > > >>>>
>  > > >>>> WC 30-40% speedup
>  > > >>>> Sort 40% speedup
>  > > >>>> Grep 5X slowdown (turns out this was due to what you described
>  > > >> above...
>  > > >>>> Grep
>  > > >>>> is just very highly optimized for command line)
>  > > >>>> Custom perl script which is essentially a For loop which
matches
>  > each
>  > > >> row
>  > > >>>> of
>  > > >>>> a dataset to a set of 100 categories) 60% speedup.
>  > > >>>>
>  > > >>>> So I do think that it depends on your script... and some
other
>  > > >> settings of
>  > > >>>> yours.
>  > > >>>>
>  > > >>>> Theo
>  > > >>>>
>  > > >>>> On Mon, Mar 31, 2008 at 2:00 PM, lin <novacore@gmail.com>
wrote:
>  > > >>>>
>  > > >>>>> Hi,
>  > > >>>>>
>  > > >>>>> I am looking into using Hadoop streaming to parallelize
some
>  > simple
>  > > >>>>> programs. So far the performance has been pretty disappointing.
>  > > >>>>>
>  > > >>>>> The cluster contains 5 nodes. Each node has two CPU cores.
The
>  > task
>  > > >>>>> capacity
>  > > >>>>> of each node is 2. The Hadoop version is 0.15.
>  > > >>>>>
>  > > >>>>> Program 1 runs for 3.5 minutes on the Hadoop cluster
and 2 minutes
>  > > >> in
>  > > >>>>> standalone (on a single CPU core). Program runs for 5
minutes on
>  > the
>  > > >>>>> Hadoop
>  > > >>>>> cluster and 4.5 minutes in standalone. Both programs
run as
>  > map-only
>  > > >>>> jobs.
>  > > >>>>>
>  > > >>>>> I understand that there is some overhead in starting
up tasks,
>  > > >> reading
>  > > >>>> to
>  > > >>>>> and writing from the distributed file system. But they
do not seem
>  > > >> to
>  > > >>>>> explain all the overhead. Most map tasks are data-local.
I
>  > modified
>  > > >>>>> program
>  > > >>>>> 1 to output nothing and saw the same magnitude of overhead.
>  > > >>>>>
>  > > >>>>> The output of top shows that the majority of the CPU
time is
>  > > >> consumed by
>  > > >>>>> Hadoop java processes (e.g.
>  > > >> org.apache.hadoop.mapred.TaskTracker$Child).
>  > > >>>>> So
>  > > >>>>> I added a profile option (-agentlib:hprof=cpu=samples)
to
>  > > >>>>> mapred.child.java.opts.
>  > > >>>>>
>  > > >>>>> The profile results show that most of CPU time is spent
in the
>  > > >> following
>  > > >>>>> methods
>  > > >>>>>
>  > > >>>>>   rank   self  accum   count trace method
>  > > >>>>>
>  > > >>>>>   1 23.76% 23.76%    1246 300472
>  > > >>>> java.lang.UNIXProcess.waitForProcessExit
>  > > >>>>>
>  > > >>>>>   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
>  > > >>>>>
>  > > >>>>>   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
>  > > >>>>>
>  > > >>>>>   4 16.15% 87.32%     847 300478
>  > java.io.FileOutputStream.writeBytes
>  > > >>>>>
>  > > >>>>> And their stack traces show that these methods are for
interacting
>  > > >> with
>  > > >>>>> the
>  > > >>>>> map program.
>  > > >>>>>
>  > > >>>>>
>  > > >>>>> TRACE 300472:
>  > > >>>>>
>  > > >>>>>
>  > > >>>>>  java.lang.UNIXProcess.waitForProcessExit(
>  > > >> UNIXProcess.java:Unknownline)
>  > > >>>>>
>  > > >>>>>        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>  > > >>>>>
>  > > >>>>>        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>  > > >>>>>
>  > > >>>>> TRACE 300474:
>  > > >>>>>
>  > > >>>>>        java.io.FileInputStream.readBytes(
>  > > >> FileInputStream.java:Unknown
>  > > >>>>> line)
>  > > >>>>>
>  > > >>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedInputStream.read1(BufferedInputStream.java
>  > > >> :256)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
>  > > >> :317)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
>  > > >> :218)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
>  > > >> :237)
>  > > >>>>>
>  > > >>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
>  > > >>>>> LineRecordReader.java:136)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
>  > > >>>>> UTF8ByteArrayUtils.java:157)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
>  > > >>>>> PipeMapRed.java:348)
>  > > >>>>>
>  > > >>>>> TRACE 300479:
>  > > >>>>>
>  > > >>>>>        java.io.FileInputStream.readBytes(
>  > > >> FileInputStream.java:Unknown
>  > > >>>>> line)
>  > > >>>>>
>  > > >>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
>  > > >> :218)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
>  > > >> :237)
>  > > >>>>>
>  > > >>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
>  > > >>>>> LineRecordReader.java:136)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
>  > > >>>>> UTF8ByteArrayUtils.java:157)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
>  > > >>>>> PipeMapRed.java:399)
>  > > >>>>>
>  > > >>>>> TRACE 300478:
>  > > >>>>>
>  > > >>>>>
>  > > >>>>>  java.io.FileOutputStream.writeBytes(
>  > > >> FileOutputStream.java:Unknownline)
>  > > >>>>>
>  > > >>>>>        java.io.FileOutputStream.write(FileOutputStream.java:260)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedOutputStream.flushBuffer(
>  > > >>>> BufferedOutputStream.java
>  > > >>>>> :65)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedOutputStream.flush(
>  > BufferedOutputStream.java
>  > > >> :123)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedOutputStream.flush(
>  > BufferedOutputStream.java
>  > > >> :124)
>  > > >>>>>
>  > > >>>>>        java.io.DataOutputStream.flush(DataOutputStream.java:106)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
>  > > >> :96)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>  > > >>>>>        org.apache.hadoop.mapred.TaskTracker$Child.main(
>  > > >> TaskTracker.java
>  > > >>>>> :1760)
>  > > >>>>>
>  > > >>>>>
>  > > >>>>> I don't understand why Hadoop streaming needs so much
CPU time to
>  > > >> read
>  > > >>>>> from
>  > > >>>>> and write to the map program. Note it takes 23.67% time
to read
>  > from
>  > > >> the
>  > > >>>>> standard error of the map program while the program does
not
>  > output
>  > > >> any
>  > > >>>>> error at all!
>  > > >>>>>
>  > > >>>>> Does anyone know any way to get rid of this seemingly
unnecessary
>  > > >>>> overhead
>  > > >>>>> in Hadoop streaming?
>  > > >>>>>
>  > > >>>>> Thanks,
>  > > >>>>>
>  > > >>>>> Lin
>  > > >>>>>
>  > > >>>>
>  > > >>>>
>  > > >>>>
>  > > >>>> --
>  > > >>>> Theodore Van Rooy
>  > > >>>> http://greentheo.scroggles.com
>  > > >>>>
>  > > >>
>  > >
>  > >
>  >
>
>
>
>  --
>  Travis Brady
>  www.mochiads.com
>

Mime
View raw message