hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Re: hadoop scales but is not performant?
Date Tue, 15 Sep 2009 16:18:39 GMT
Some thoughts, inlined.


On Sep 14, 2009, at 6:14 PM, Mat Kelcey wrote:

> hi all,
> recently i've started playing with hadoop and my first learning
> experiment has surprised me
> i'm implementing a a text problem; trying to extract infrequent
> phrases using a mixture of probabilistic models.
> it's a bit of toy problem so the algorithm details aren't super
> important, though the implementation might be....
> this problem ended up being represented by a dozen or so map reduce
> jobs of various types including some aggregate steps and some manual
> joins.
> (see the bottom of this page for details
> http://matpalm.com/sip/take3_markov_chains.html)
> i've implemented each step using ruby / streaming, the code is at
> http://github.com/matpalm/sip if anyone cares.
> i ran some tests using 10 ec2 medium cpu instance across a small
> 100mb's of gzipped text.
> for validation of the results i also reimplemented the entire
> algorithm as a single threaded ruby app
> my surprise comes from finding that the ruby implementation
> outperforms the 10 ec2 instances on this data size...
> i ran a few samples of different sizes with the graph at the bottom of
> http://matpalm.com/sip/part4_but_does_it_scale.html
> so why is this? here are my explanations in order of how confident i  
> am...
> a) 100mb is peanuts and hadoop was made for 1000x this size so the
> test is invalid.
Definitely the case.  At this size your start costs will swamp any  
performance benefits of parallelism.

> b) there is a better representation of this problem that uses fewer
> map/reduce passes.
> c) streaming is too slow and rewriting in java (and making use of
> techniques like chaining mappers) would speed things up
Maybe.  Streaming itself is a little slower than using java.  I don't  
know what the penalty of using java versus ruby is.

> d) doing these steps, particularly the joins, in pig would be faster
Writing your joins in Pig will definitely be faster to code.  They  
won't be faster to execute unless you are able to use one of Pig's  
specialized join algorithms (fragment-replicate, merge, skew).  At  
100mb its hard to see that any of those will make a big difference.

> my next steps are to rewrite some of the steps in pig to sample the  
> difference
> does anyone have any high level comments on this?
> cheers,
> mat

View raw message