mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Lu <...@atypon.com>
Subject Re: LDA on single node is much faster than 20 nodes
Date Tue, 06 Sep 2011 23:44:53 GMT
I see, thanks!

Seems it should build into Mahout LDA algorithms, since the input file 
is usually not too large, but really needs parallel mapping processes.

Chris

On 09/06/2011 04:28 PM, Jake Mannix wrote:
> You can't just set the block size, you need to modify the InputFormat to
> change
> the number of splits.  For example, you can do:
>
>      FileInputFormat.setMaxInputSplitSize(job, maxSizeInBytes);
>
> and you'll force it to make more splits in your data set, and hence more
> mappers.
>
>    -jake
>
> On Tue, Sep 6, 2011 at 4:12 PM, Dhruv Kumar<dkumar@ecs.umass.edu>  wrote:
>
>> On Tue, Sep 6, 2011 at 6:57 PM, Chris Lu<clu@atypon.com>  wrote:
>>
>>> Thanks. Very helpful to me!
>>>
>>> I tried to change the setting of "mapred.map.tasks".  However, the number
>>> map task is still just one on one of the 20 machines.
>>>
>>> ./elastic-mapreduce --create --alive \
>>>    --num-instances 20 --name "LDA" \
>>>    --bootstrap-action
>> s3://elasticmapreduce/**bootstrap-actions/configure-*
>>> *hadoop \
>>>    --bootstrap-name "Configuring number of map tasks per job" \
>>>    --args "-m,mapred.map.tasks=40"
>>>
>>> Anyone knows how to configure the number of mappers?
>>> Again, the input size is only 46M.
>>>
>>> Chris
>>>
>>>
>>> On 09/06/2011 12:09 PM, Ted Dunning wrote:
>>>
>>>> Well, I think that using small instances is a disaster in general.  The
>>>> performance that you get from them can vary easily by an order of
>>>> magnitude.
>>>>   My own preference for real work is either m2xl or cc14xl.  The latter
>>>> machines give you nearly bare metal performance and no noisy neighbors.
>>>>   The
>>>> m2xl is typically very much underpriced on the spot market.
>>>>
>>>> Sean is right about your job being misconfigured.  The Hadoop overhead
>> is
>>>> considerable and you have only given it two threads to overcome that
>>>> overhead.
>>>>
>>>> On Tue, Sep 6, 2011 at 6:12 PM, Sean Owen<srowen@gmail.com>   wrote:
>>>>
>>>>   That's your biggest issue, certainly. Only 2 mappers are running, even
>>>>> though you have 20 machines available. Hadoop determines the number of
>>>>> mappers based on input size, and your input isn't so big that it thinks
>>>>> you
>>>>> need 20 workers. It's launching 33 reducers, so your cluster is put to
>>>>> use
>>>>> there. But it's no wonder you're not seeing anything like 20x speedup
>> in
>>>>> the
>>>>> mapper.
>>>>>
>>>>> You can of course force it to use more mappers, and that's probably a
>>>>> good
>>>>> idea here. -Dmapred.map.tasks=20 perhaps. More mappers means more
>>>>> overhead
>>>>> of spinning up mappers to process less data, and Hadoop's guess
>> indicates
>>>>> that it thinks it's not efficient to use 20 workers. If you know that
>>>>> those
>>>>> other 18 are otherwise idle, my guess is you'd benefit from just making
>>>>> it
>>>>> use 20.
>>>>>
>> Sean,
>>
>> I too have always been confused about how Hadoop decides to set the number
>> of mappers so you could help my understanding here...
>>
>> Is -Dmapred.map.tasks just a hint to the framework for the number of
>> mappers
>> (just like using the combiner is a hint) or does it actually set the number
>> of workers to that number (provided our input is large enough)?
>>
>> The reason I ask is because on
>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces, it is mentioned that
>> the framework uses the HDFS block size to decide on the number of mapper
>> workers to be invoked. Should we be setting that parameter instead?
>>
>>
>>>>> If this were a general large cluster where many people are taking
>>>>> advantage
>>>>> of the workers, then I'd trust Hadoop's guesses until you are sure  you
>>>>> want
>>>>> to do otherwise.
>>>>>
>>>>> On Tue, Sep 6, 2011 at 7:02 PM, Chris Lu<clu@atypon.com>   wrote:
>>>>>
>>>>>   Thanks for all the suggestions!
>>>>>> All the inputs are the same. It takes 85 hours for 4 iterations on
20
>>>>>> Amazon small machines. On my local single node, it got to iteration
19
>>>>>>
>>>>> for
>>>>>
>>>>>> also 85 hours.
>>>>>>
>>>>>> Here is a section of the Amazon log output.
>>>>>> It covers the start of iteration 1, and between iteration 4 and
>>>>>> iteration
>>>>>> 5.
>>>>>>
>>>>>> The number of map tasks is set to 2. Should it be larger or related
to
>>>>>> number of CPU cores?
>>>>>>
>>>>>>
>>>>>>


Mime
View raw message