mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From fx MA XIAOJUN <xiaojun...@fujixerox.co.jp>
Subject RE: reduce is too slow in StreamingKmeans
Date Tue, 18 Mar 2014 01:49:38 GMT
Thank you for your extremely quick reply.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming
KMeans here?
I want to try using -rskm in streaming kmeans. 
But in mahout 0.8, if setting -rskm as true, errors occur.
I heard that the bug has been fixed in 0.9. So I upgraded 0.8->0.9


The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 2.x(YARN)
cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is compiled by
cloudera.
So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 distribution. 
It turned out that "Mahout kmeans" runs very well on mapreduce.
However, "Mahout streamingkmeans" runs properly in sequential mode, but fails in mapreduce
mode.

If it is the problem of incompatibility between hadoop and mahout, I don’t think "mahout
kmeans" can run properly.

Is mahout 0.9 compatible with Hadoop 0.20?





-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com] 
Sent: Monday, March 17, 2014 6:21 PM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans





On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN <xiaojun.ma@fujixerox.co.jp> wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 140000 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8)
The maps run faster than before, but the reduce was still stuck at 76% for ever.

>> This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option.

Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming
KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.
----------------------------------------------------------------------------------------
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext,
but class was expected
    at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
--------------------------------------------------------------------------------------------


Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x
profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the code with Hadoop
2 profile like below:

mvn clean install -Dhadoop2.profile=<hadoop 2.x version>

Please give that a try.





-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com]
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance
that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 10000 clusters (= k) and have 2,000,000 datapoints (= n) so k * ln(n)
= 10000 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) and that should be the value
of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its definitely worth
a try.

If you would like go to with -rskm option, please upgrade to Mahout 0.9.  I still think there's
an issue with -rskm option with Mahout 0.9 and trunk today while executing in MR mode, but
it definitely works in the nonMR (-xm sequential) mode in 0.9.











On Monday, February 17, 2014 9:05 PM, Sylvia Ma <Xiaojun.ma@fujixerox.co.jp> wrote:

I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that reduce of mahout
streamingkmeans is extremely slow.

For example:
With a dataset of 2000000 objects, 128 variables, I would like to get 10000 clusters.

The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 10000 -km 63000

I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still stuck at 76%.

I have tuned performance of hadoop as the following. 
map task jvm = 3g
reduce task jvm = 10g
io.sort.mb = 512
io.sort.factor = 50
mapred.reduce.parallel.copies = 10
mapred.inmem.merge.threshold = 0 

I tried to assign enough memory but the reduce is still very very very slow.


Why does it take so much time in reduce?
And What can I do to speed up the job?

I wonder if it will be helpful to set -rskm to be true.
-rskm option has bug in Mahout 0.8, so I cannot get a try... 




Yours Sincerely,
Sylvia Ma



Mime
View raw message