hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rennie" <jren...@gmail.com>
Subject Re: performance
Date Wed, 12 Mar 2008 15:21:48 GMT
Hmm... sounds promising :)  How do you distribute the data?  Do you use
HDFS?  Pass the data directly to the individual nodes?  We really only need
to do the map operation like you.  We need to distribute a matrix * vector
operation, so we want rows of the matrix distributed across different
nodes.  Map could perform all the dot-products, which is the heavy lifting
in what we're trying to do.  Might want to do a reduce after that, not
sure...

Jason

On Tue, Mar 11, 2008 at 6:36 PM, Theodore Van Rooy <munkey906@gmail.com>
wrote:

> There is overhead in grabbing local data, moving it in and out of the
> system
> and especially if you are running a map reduce job (like wc) which ends up
> mapping, sorting, copying, reducing, and writing again.
>
> One way I've found to get around the overhead is to use Hadoop streaming
> and
> perform map only tasks.  While they recommend doing it properly with
>
> hstream -mapper /bin/cat -reducer /bin/wc
>
> I tried:
>
> hstream -input "myinputfile" -output "myoutput" -mapper /bin/wc
> -numReduceTasks 0
>
> (hstream is just an alias to do Hadoop streaming)
>
> And saw an immediate speedup on a 1 Gig and 10 Gig file.
>
> In the end you may have several output files with the wordcount for each
> file, but adding those files together is pretty quick and easy.
>
> My recommendation is to explore how how you can get away with either
> Identity Reduces, Maps or no reduces at all.
>
> Theo
>

-- 
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message