mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Lu <...@atypon.com>
Subject Re: LDA on single node is much faster than 20 nodes
Date Tue, 06 Sep 2011 22:57:17 GMT
Thanks. Very helpful to me!

I tried to change the setting of "mapred.map.tasks".  However, the 
number map task is still just one on one of the 20 machines.

./elastic-mapreduce --create --alive \
    --num-instances 20 --name "LDA" \
    --bootstrap-action 
s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
    --bootstrap-name "Configuring number of map tasks per job" \
    --args "-m,mapred.map.tasks=40"

Anyone knows how to configure the number of mappers?
Again, the input size is only 46M.

Chris

On 09/06/2011 12:09 PM, Ted Dunning wrote:
> Well, I think that using small instances is a disaster in general.  The
> performance that you get from them can vary easily by an order of magnitude.
>   My own preference for real work is either m2xl or cc14xl.  The latter
> machines give you nearly bare metal performance and no noisy neighbors.  The
> m2xl is typically very much underpriced on the spot market.
>
> Sean is right about your job being misconfigured.  The Hadoop overhead is
> considerable and you have only given it two threads to overcome that
> overhead.
>
> On Tue, Sep 6, 2011 at 6:12 PM, Sean Owen<srowen@gmail.com>  wrote:
>
>> That's your biggest issue, certainly. Only 2 mappers are running, even
>> though you have 20 machines available. Hadoop determines the number of
>> mappers based on input size, and your input isn't so big that it thinks you
>> need 20 workers. It's launching 33 reducers, so your cluster is put to use
>> there. But it's no wonder you're not seeing anything like 20x speedup in
>> the
>> mapper.
>>
>> You can of course force it to use more mappers, and that's probably a good
>> idea here. -Dmapred.map.tasks=20 perhaps. More mappers means more overhead
>> of spinning up mappers to process less data, and Hadoop's guess indicates
>> that it thinks it's not efficient to use 20 workers. If you know that those
>> other 18 are otherwise idle, my guess is you'd benefit from just making it
>> use 20.
>>
>> If this were a general large cluster where many people are taking advantage
>> of the workers, then I'd trust Hadoop's guesses until you are sure  you
>> want
>> to do otherwise.
>>
>> On Tue, Sep 6, 2011 at 7:02 PM, Chris Lu<clu@atypon.com>  wrote:
>>
>>> Thanks for all the suggestions!
>>>
>>> All the inputs are the same. It takes 85 hours for 4 iterations on 20
>>> Amazon small machines. On my local single node, it got to iteration 19
>> for
>>> also 85 hours.
>>>
>>> Here is a section of the Amazon log output.
>>> It covers the start of iteration 1, and between iteration 4 and iteration
>>> 5.
>>>
>>> The number of map tasks is set to 2. Should it be larger or related to
>>> number of CPU cores?
>>>
>>>


Mime
View raw message