mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Lu <>
Subject Re: LDA on single node is much faster than 20 nodes
Date Wed, 07 Sep 2011 18:09:57 GMT
Thanks! I suspect "" would not work 
because I could not find it in current Hadoop code.

Changing either "mapred.max.split.size" or "dfs.block.size", and also 
change "" worked for me.

On EMR, I found I can monitor the progresses by viewing Hadoop 
Administration page, running on port 9100. (Need to do a port forwarding 
to actually view it.). It's much better than the logs.

It's great I can get some job done in parallel. Otherwise, seems lose 
the point to get Mahout on Hadoop MR.

Thanks everyone for ideas!


On 09/07/2011 01:39 AM, Sean Owen wrote:
> I see. On EMR, I think the setting you need
> is At least that's what I see digging
> through my old EMR code.
> Dhruv, yes a lot of these settings are just suggestions to the framework. I
> am not entirely clear on the heuristics used, but I do know that Jake is
> right, that it's driven primarily off the input size, and how much input it
> thinks should go with a worker. You can override these things, but do
> beware, you're probably incurring more overhead than is sensible. It might
> still make sense if you're running on a dedicated cluster where those
> resources are otherwise completely idle, but, not in general a good idea in
> a shared cluster.
> Chris are you sure one mapper was running in your last example? I don't see
> an indication of that from the log output one way or the other.
> I don't know LDA well. It sounds like you are saying that LDA mappers take a
> long time on a little input, which would suggest that's a bottleneck. I
> don't know one way or the other there... but if that's true you are right
> that we can bake in settings to force an unusually small input split size.
> And Jake's last point echoes Sebastian and Ted's: on EMR, fewer big machines
> are better. One of their biggest instances is probably more economical than
> 20 small ones. And, as a bonus, all of the data and processing will stay on
> one machine. (Of course, the master is still a separate instance. I use 1
> small machine for the master, and make it a reserved instance, not a spot
> instance, so it's really unlikely to die.) Of course you're vulnerable to
> that one machine dying, but, for all practical purposes it's going to be a
> win for you.
> Definitely use the spot instance market! Ted's right that pricing is crazy
> good.
> On Tue, Sep 6, 2011 at 11:57 PM, Chris Lu<>  wrote:
>> Thanks. Very helpful to me!
>> I tried to change the setting of "".  However, the number
>> map task is still just one on one of the 20 machines.
>> ./elastic-mapreduce --create --alive \
>>    --num-instances 20 --name "LDA" \
>>    --bootstrap-action s3://elasticmapreduce/**bootstrap-actions/configure-*
>> *hadoop \
>>    --bootstrap-name "Configuring number of map tasks per job" \
>>    --args "-m,"
>> Anyone knows how to configure the number of mappers?
>> Again, the input size is only 46M.
>> Chris
>> On 09/06/2011 12:09 PM, Ted Dunning wrote:
>>> Well, I think that using small instances is a disaster in general.  The
>>> performance that you get from them can vary easily by an order of
>>> magnitude.
>>>   My own preference for real work is either m2xl or cc14xl.  The latter
>>> machines give you nearly bare metal performance and no noisy neighbors.
>>>   The
>>> m2xl is typically very much underpriced on the spot market.
>>> Sean is right about your job being misconfigured.  The Hadoop overhead is
>>> considerable and you have only given it two threads to overcome that
>>> overhead.
>>> On Tue, Sep 6, 2011 at 6:12 PM, Sean Owen<>   wrote:
>>>   That's your biggest issue, certainly. Only 2 mappers are running, even
>>>> though you have 20 machines available. Hadoop determines the number of
>>>> mappers based on input size, and your input isn't so big that it thinks
>>>> you
>>>> need 20 workers. It's launching 33 reducers, so your cluster is put to
>>>> use
>>>> there. But it's no wonder you're not seeing anything like 20x speedup in
>>>> the
>>>> mapper.
>>>> You can of course force it to use more mappers, and that's probably a
>>>> good
>>>> idea here. perhaps. More mappers means more
>>>> overhead
>>>> of spinning up mappers to process less data, and Hadoop's guess indicates
>>>> that it thinks it's not efficient to use 20 workers. If you know that
>>>> those
>>>> other 18 are otherwise idle, my guess is you'd benefit from just making
>>>> it
>>>> use 20.
>>>> If this were a general large cluster where many people are taking
>>>> advantage
>>>> of the workers, then I'd trust Hadoop's guesses until you are sure  you
>>>> want
>>>> to do otherwise.
>>>> On Tue, Sep 6, 2011 at 7:02 PM, Chris Lu<>   wrote:
>>>>   Thanks for all the suggestions!
>>>>> All the inputs are the same. It takes 85 hours for 4 iterations on 20
>>>>> Amazon small machines. On my local single node, it got to iteration 19
>>>> for
>>>>> also 85 hours.
>>>>> Here is a section of the Amazon log output.
>>>>> It covers the start of iteration 1, and between iteration 4 and
>>>>> iteration
>>>>> 5.
>>>>> The number of map tasks is set to 2. Should it be larger or related to
>>>>> number of CPU cores?

View raw message