mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: LDA on single node is much faster than 20 nodes
Date Wed, 07 Sep 2011 08:39:03 GMT
I see. On EMR, I think the setting you need
is At least that's what I see digging
through my old EMR code.

Dhruv, yes a lot of these settings are just suggestions to the framework. I
am not entirely clear on the heuristics used, but I do know that Jake is
right, that it's driven primarily off the input size, and how much input it
thinks should go with a worker. You can override these things, but do
beware, you're probably incurring more overhead than is sensible. It might
still make sense if you're running on a dedicated cluster where those
resources are otherwise completely idle, but, not in general a good idea in
a shared cluster.

Chris are you sure one mapper was running in your last example? I don't see
an indication of that from the log output one way or the other.

I don't know LDA well. It sounds like you are saying that LDA mappers take a
long time on a little input, which would suggest that's a bottleneck. I
don't know one way or the other there... but if that's true you are right
that we can bake in settings to force an unusually small input split size.

And Jake's last point echoes Sebastian and Ted's: on EMR, fewer big machines
are better. One of their biggest instances is probably more economical than
20 small ones. And, as a bonus, all of the data and processing will stay on
one machine. (Of course, the master is still a separate instance. I use 1
small machine for the master, and make it a reserved instance, not a spot
instance, so it's really unlikely to die.) Of course you're vulnerable to
that one machine dying, but, for all practical purposes it's going to be a
win for you.

Definitely use the spot instance market! Ted's right that pricing is crazy

On Tue, Sep 6, 2011 at 11:57 PM, Chris Lu <> wrote:

> Thanks. Very helpful to me!
> I tried to change the setting of "".  However, the number
> map task is still just one on one of the 20 machines.
> ./elastic-mapreduce --create --alive \
>   --num-instances 20 --name "LDA" \
>   --bootstrap-action s3://elasticmapreduce/**bootstrap-actions/configure-*
> *hadoop \
>   --bootstrap-name "Configuring number of map tasks per job" \
>   --args "-m,"
> Anyone knows how to configure the number of mappers?
> Again, the input size is only 46M.
> Chris
> On 09/06/2011 12:09 PM, Ted Dunning wrote:
>> Well, I think that using small instances is a disaster in general.  The
>> performance that you get from them can vary easily by an order of
>> magnitude.
>>  My own preference for real work is either m2xl or cc14xl.  The latter
>> machines give you nearly bare metal performance and no noisy neighbors.
>>  The
>> m2xl is typically very much underpriced on the spot market.
>> Sean is right about your job being misconfigured.  The Hadoop overhead is
>> considerable and you have only given it two threads to overcome that
>> overhead.
>> On Tue, Sep 6, 2011 at 6:12 PM, Sean Owen<>  wrote:
>>  That's your biggest issue, certainly. Only 2 mappers are running, even
>>> though you have 20 machines available. Hadoop determines the number of
>>> mappers based on input size, and your input isn't so big that it thinks
>>> you
>>> need 20 workers. It's launching 33 reducers, so your cluster is put to
>>> use
>>> there. But it's no wonder you're not seeing anything like 20x speedup in
>>> the
>>> mapper.
>>> You can of course force it to use more mappers, and that's probably a
>>> good
>>> idea here. perhaps. More mappers means more
>>> overhead
>>> of spinning up mappers to process less data, and Hadoop's guess indicates
>>> that it thinks it's not efficient to use 20 workers. If you know that
>>> those
>>> other 18 are otherwise idle, my guess is you'd benefit from just making
>>> it
>>> use 20.
>>> If this were a general large cluster where many people are taking
>>> advantage
>>> of the workers, then I'd trust Hadoop's guesses until you are sure  you
>>> want
>>> to do otherwise.
>>> On Tue, Sep 6, 2011 at 7:02 PM, Chris Lu<>  wrote:
>>>  Thanks for all the suggestions!
>>>> All the inputs are the same. It takes 85 hours for 4 iterations on 20
>>>> Amazon small machines. On my local single node, it got to iteration 19
>>> for
>>>> also 85 hours.
>>>> Here is a section of the Amazon log output.
>>>> It covers the start of iteration 1, and between iteration 4 and
>>>> iteration
>>>> 5.
>>>> The number of map tasks is set to 2. Should it be larger or related to
>>>> number of CPU cores?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message