hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Theodore Van Rooy" <munkey...@gmail.com>
Subject Re: Fastest way to do grep via hadoop streaming
Date Wed, 19 Mar 2008 15:26:44 GMT
Thanks for the response, very informative!  I'll spend some time looking
through the streaming code and try to get an even better understanding of
the streaming process.

On Tue, Mar 18, 2008 at 4:44 PM, Joydeep Sen Sarma <jssarma@facebook.com>
wrote:

> i hope this is not an error in setup - but many multiples worse is not
> surprising (but not nice).
>
> just think about the number of times hadoop will copy/scan data around (as
> opposed to 'grep' - which is probably ultra optimized by this time) ..
>
> - starting from getting bytes out of a file - they will first be buffered
> in a java buffered stream (copy #1)
> - then the buffered stream will be scanned for lines worth of data and
> then copied into a Text (#2)
> - the Text will then be written out to a buffered output stream (#3) to
> the streaming script.
> - perhaps, someone will tell me why the buffered output stream is flushed
> every iteration by Streaming - but it is:
>        clientOut_.flush();
>  in any case - that's likely a system call every single line of input data
> that copies into kernel space (#4)
>
> once the data comes out of grep - we get another bunch - but who cares -
> it's 2% of the data.
>
> i don't know the dfs stack well enough to count copies there - but we can
> probably bet that there's quite a few there as well. (for one - we will be
> scanning the data at least once to do the crc check)
>
> with 4 threads pounding the cpu and so much copying going around (and this
> is not counting that java itself is reputedly memory intensive) - we are
> probably memory bound by this time (which shows up as cpu bound).
>
> sigh.
>
>
>
>
> -----Original Message-----
> From: Theodore Van Rooy [mailto:munkey906@gmail.com]
> Sent: Tue 3/18/2008 3:09 PM
> To: core-user@hadoop.apache.org
> Subject: Fastest way to do grep via hadoop streaming
>
> I've been benchmarking hadoop streaming against just regular old command
> line grep.
>
> I set the job to run 4 tasks at a time per box, with one box (with 4
> processors).  The file is a 54 Gb file with <100 bytes per line (DFS block
> size 128 MB).  I grep an item that shows up in about 2% of the lines in
> the
> data set.
>
> And then I set
> -mapper "/bin/grep myregexp"
> -numReduceTasks 0
>
> MapReduce gives me a time to complete on average of about 45 minutes.
>
> Command Line Unix gives me a time to complete of about 7 minutes.
>
> Then I did the same with a much smaller file (1 GB) and still got MR=3min,
> Linux=7seconds)
>
> Does anyone know of a better/faster way to do grep via streaming?
>
> Is there a better, more optimized version written in Java or Python?
>
> Last, why would the method I am using take so long?  I've determined that
> some of the time is write time (output) from the mappers... but could it
> really be that much overhead due to read time?
>
> Thanks for your help!
> --
> Theodore Van Rooy
> http://greentheo.scroggles.com
>
>


-- 
Theodore Van Rooy
http://greentheo.scroggles.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message