mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: LDA on single node is much faster than 20 nodes
Date Tue, 06 Sep 2011 19:09:48 GMT
Well, I think that using small instances is a disaster in general.  The
performance that you get from them can vary easily by an order of magnitude.
 My own preference for real work is either m2xl or cc14xl.  The latter
machines give you nearly bare metal performance and no noisy neighbors.  The
m2xl is typically very much underpriced on the spot market.

Sean is right about your job being misconfigured.  The Hadoop overhead is
considerable and you have only given it two threads to overcome that
overhead.

On Tue, Sep 6, 2011 at 6:12 PM, Sean Owen <srowen@gmail.com> wrote:

> That's your biggest issue, certainly. Only 2 mappers are running, even
> though you have 20 machines available. Hadoop determines the number of
> mappers based on input size, and your input isn't so big that it thinks you
> need 20 workers. It's launching 33 reducers, so your cluster is put to use
> there. But it's no wonder you're not seeing anything like 20x speedup in
> the
> mapper.
>
> You can of course force it to use more mappers, and that's probably a good
> idea here. -Dmapred.map.tasks=20 perhaps. More mappers means more overhead
> of spinning up mappers to process less data, and Hadoop's guess indicates
> that it thinks it's not efficient to use 20 workers. If you know that those
> other 18 are otherwise idle, my guess is you'd benefit from just making it
> use 20.
>
> If this were a general large cluster where many people are taking advantage
> of the workers, then I'd trust Hadoop's guesses until you are sure  you
> want
> to do otherwise.
>
> On Tue, Sep 6, 2011 at 7:02 PM, Chris Lu <clu@atypon.com> wrote:
>
> > Thanks for all the suggestions!
> >
> > All the inputs are the same. It takes 85 hours for 4 iterations on 20
> > Amazon small machines. On my local single node, it got to iteration 19
> for
> > also 85 hours.
> >
> > Here is a section of the Amazon log output.
> > It covers the start of iteration 1, and between iteration 4 and iteration
> > 5.
> >
> > The number of map tasks is set to 2. Should it be larger or related to
> > number of CPU cores?
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message