mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Lu <...@atypon.com>
Subject Re: LDA on single node is much faster than 20 nodes
Date Tue, 06 Sep 2011 18:02:08 GMT
Thanks for all the suggestions!

All the inputs are the same. It takes 85 hours for 4 iterations on 20 
Amazon small machines. On my local single node, it got to iteration 19 
for also 85 hours.

Here is a section of the Amazon log output.
It covers the start of iteration 1, and between iteration 4 and iteration 5.

The number of map tasks is set to 2. Should it be larger or related to 
number of CPU cores?


- what kinds of machines in the single case and the cluster case?

Amazon Small type.

- did you actually complete a stage with the EMR cluster?

Yes.

- did you have any task failures?

Everything seems ok.

- were your machines swapping?

If run on local machine with 1G max heap size, no memory swapping. Amazon's Small type has
1.7G.

- what was CPU usage?

I can not tell myself.

- what was network usage?

I can not tell. The mapping process should not use much network. However, it's really slow.

- how much data was registered as having been read?  Was that reasonable?
The input size is 47M, stored on S3.


2011-09-02 21:40:42,905 INFO org.apache.mahout.clustering.lda.LDADriver 
(main): LDA Iteration 1
2011-09-02 21:40:42,989 INFO org.apache.hadoop.mapred.JobClient (main): 
Default number of map tasks: 2
2011-09-02 21:40:42,989 INFO org.apache.hadoop.mapred.JobClient (main): 
Default number of reduce tasks: 33
2011-09-02 21:40:46,081 INFO 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total 
input paths to process : 1
2011-09-02 21:40:50,199 INFO org.apache.hadoop.mapred.JobClient (main): 
Running job: job_201109022131_0001
2011-09-02 21:40:51,208 INFO org.apache.hadoop.mapred.JobClient (main):  
map 0% reduce 0%
2011-09-02 22:00:25,591 INFO org.apache.hadoop.mapred.JobClient (main):  
map 1% reduce 0%
2011-09-02 22:17:37,315 INFO org.apache.hadoop.mapred.JobClient (main):  
map 2% reduce 0%
2011-09-02 22:30:36,622 INFO org.apache.hadoop.mapred.JobClient (main):  
map 3% reduce 0%
2011-09-02 22:42:44,128 INFO org.apache.hadoop.mapred.JobClient (main):  
map 4% reduce 0%
2011-09-02 22:57:20,817 INFO org.apache.hadoop.mapred.JobClient (main):  
map 5% reduce 0%
2011-09-02 23:11:20,329 INFO org.apache.hadoop.mapred.JobClient (main):  
map 6% reduce 0%
2011-09-02 23:24:54,832 INFO org.apache.hadoop.mapred.JobClient (main):  
map 7% reduce 0%
2011-09-02 23:37:50,205 INFO org.apache.hadoop.mapred.JobClient (main):  
map 8% reduce 0%
...............
...............
...............
...............
2011-09-06 16:50:35,471 INFO org.apache.hadoop.mapred.JobClient (main):  
map 100% reduce 91%
2011-09-06 16:50:36,499 INFO org.apache.hadoop.mapred.JobClient (main):  
map 100% reduce 92%
2011-09-06 16:50:37,503 INFO org.apache.hadoop.mapred.JobClient (main):  
map 100% reduce 94%
2011-09-06 16:50:39,511 INFO org.apache.hadoop.mapred.JobClient (main):  
map 100% reduce 96%
2011-09-06 16:50:53,576 INFO org.apache.hadoop.mapred.JobClient (main):  
map 100% reduce 97%
2011-09-06 16:51:15,665 INFO org.apache.hadoop.mapred.JobClient (main):  
map 100% reduce 99%
2011-09-06 16:51:58,848 INFO org.apache.hadoop.mapred.JobClient (main):  
map 100% reduce 100%
2011-09-06 16:52:42,039 INFO org.apache.hadoop.mapred.JobClient (main): 
Job complete: job_201109022131_0004
2011-09-06 16:52:42,041 INFO org.apache.hadoop.mapred.JobClient (main): 
Counters: 17
2011-09-06 16:52:42,041 INFO org.apache.hadoop.mapred.JobClient 
(main):   Job Counters
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     Launched reduce tasks=33
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     Rack-local map tasks=1
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     Launched map tasks=1
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):   FileSystemCounters
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     S3N_BYTES_READ=732125802
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     FILE_BYTES_READ=27958301147
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     S3N_BYTES_WRITTEN=684537903
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     FILE_BYTES_WRITTEN=26164696939
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):   Map-Reduce Framework
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     Reduce input groups=28242101
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     Combine output records=851254845
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     Map input records=18285
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     Reduce shuffle bytes=334153034
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     Reduce output records=28242101
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     Spilled Records=2093587887
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     Map output bytes=19571033360
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     Combine input records=2046202329
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     Map output records=1223189585
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient 
(main):     Reduce input records=28242101
2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver 
(main): Iteration 4 finished. Log Likelihood: -2.738323998517175E8
2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver 
(main): (Old LL: -2.810158757091537E8)
2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver 
(main): (Rel Change: 0.02556252681208305)
2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver 
(main): LDA Iteration 5
2011-09-06 16:54:04,024 INFO org.apache.hadoop.mapred.JobClient (main): 
Default number of map tasks: 2
2011-09-06 16:54:04,024 INFO org.apache.hadoop.mapred.JobClient (main): 
Default number of reduce tasks: 33
2011-09-06 16:54:04,751 INFO 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total 
input paths to process : 1
2011-09-06 16:54:05,829 INFO org.apache.hadoop.mapred.JobClient (main): 
Running job: job_201109022131_0005
2011-09-06 16:54:06,835 INFO org.apache.hadoop.mapred.JobClient (main):  
map 0% reduce 0%
2011-09-06 17:08:40,517 INFO org.apache.hadoop.mapred.JobClient (main):  
map 1% reduce 0%
2011-09-06 17:19:20,825 INFO org.apache.hadoop.mapred.JobClient (main):  
map 2% reduce 0%


On 09/06/2011 08:39 AM, Ted Dunning wrote:
> I think that Sean and Danny are right.  You need to give a bit more
> information.
>
> - what kinds of machines in the single case and the cluster case?
>
> - did you actually complete a stage with the EMR cluster?
>
> - did you have any task failures?
>
> - were your machines swapping?
>
> - what was CPU usage?
>
> - what was network usage?
>
> - how much data was registered as having been read?  Was that reasonable?
>
> On Tue, Sep 6, 2011 at 3:11 AM, Sean Owen<srowen@gmail.com>  wrote:
>
>> Running on a real cluster increases the amount of work done, and
>> significantly, as compared to one node: now, data actually has to be
>> transferred on/off the machine!
>>
>> Amazon EMR workers, in my experience, are bottlenecked on I/O. I am not
>> sure
>> what instance type you are using but I got better mileage when I used
>> larger
>> instances (and more of my own workers per instance, of course; it does that
>> for you too).
>>
>> You may have trouble correctly extrapolating from the time it takes to hit
>> 1% as there are setup costs as the instance spin up. Try letting it run a
>> bit more to see how fast it really seems to go.
>>
>> Are you saying you extrapolate that it would take 1 EMR machine 1000
>> minutes
>> to finish? that sounds quite reasonable compared to 300 minutes locally. If
>> you mean the whole 20 machines is taking 1000 minutes to finish, that
>> sounds
>> quite bad.
>>
>>
>> On Tue, Sep 6, 2011 at 8:35 AM, Chris Lu<clu@atypon.com>  wrote:
>>
>>> Hi,
>>>
>>> I am running LDA on 18k documents, each document has 5k terms. total 300k
>>> terms. Topics is set to 100.
>>>
>>> Running LDA on Hadoop single node configuration takes about 5 hours per
>>> stage. And 20 stages would take 100 hours.
>>>
>>> However, given 20 machines, running on Amazon EMR is actually much much
>>> slower. It takes 1000 minutes per stage. (It takes about 10 minutes for
>> 1%
>>> mapping progress.) Reducing is much faster is counted in seconds, almost
>>> neglect-able.
>>>
>>> Does anyone has similar experience or my setup is wrong?
>>>
>>> Chris
>>>
>>>


Mime
View raw message