mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Kmeans -k cluster flag not being recognised
Date Tue, 18 Nov 2014 00:24:12 GMT
1) Use Mahout 0.9. There may be some slight differences from the version in MIA but there are
also many bug fixes.
2) k is set to 20, check the log " --numClusters=[20]”
3) going from memory (which could be failing me) you either give it initial clusters or not.
Giving it a path tells it to use the clusters there (used to be used with Canopy, now deprecated).
Try leaving the reuters-initial-clusters path unset.

On Nov 11, 2014, at 8:53 PM, Sean Farrell <drsafarrell@gmail.com> wrote:

Hi all,

I'm working through the Kmeans clustering example in 'Mahout in Action' and
I've run into an issue regarding randomly generating the initial cluster
centroids. According to MIA (and the examples on the Mahout web page) if
you set the -k flag then the algorithm will use a random seed generator to
produce initial cluster centroids for however many clusters you select
(i.e. the number after -k). However, I'm getting an illegal state exception
error saying that no clusters are found in my directory path and that I
should check my -c argument (which sets the path for the initial cluster
centroids sequence file). Reading through the output prior to the error it
seems as though the -k flag is not being recognised.

A search through the mailing list archive finds that this is not a new
problem, but I can't find a solution posted anywhere (other than one case
where upgrading from v0.7 to v0.8 fixed it). Does anyone know if this has
been solved?

Here are the commands I am using:

> mahout kmeans -i /user/hdfs/Vectors/reuters-
vectors/tfidf-vectors/ -c /user/hdfs/Vectors/reuters-initial-clusters/ -o
/user/hdfs/Vectors/reuters-kmeans-clusters/ -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0
-k 20 -x 20 -cl


And hear is the output:

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using
/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop/bin/hadoop and
HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB:
/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/mahout/mahout-examples-0.9-cdh5.2.0-job.jar
14/11/12 15:23:56 WARN driver.MahoutDriver: No kmeans.props found on
classpath, will use command-line arguments only
14/11/12 15:23:57 INFO common.AbstractJob: Command line arguments:
{--clustering=null,
--clusters=[/user/hdfs/Vectors/reuters-initial-clusters/],
--convergenceDelta=[1.0],
--distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure],
--endPhase=[2147483647],
--input=[/user/hdfs/Vectors/reuters-vectors/tfidf-vectors/],
--maxIter=[20], --method=[mapreduce], --numClusters=[20],
--output=[/user/hdfs/Vectors/reuters-kmeans-clusters/], --startPhase=[0],
--tempDir=[temp]}
14/11/12 15:23:57 INFO common.HadoopUtil: Deleting
/user/hdfs/Vectors/reuters-initial-clusters
14/11/12 15:23:58 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
14/11/12 15:23:58 INFO compress.CodecPool: Got brand-new compressor
[.deflate]
14/11/12 15:23:58 INFO kmeans.RandomSeedGenerator: Wrote 20 Klusters to
/user/hdfs/Vectors/reuters-initial-clusters/part-randomSeed
14/11/12 15:23:58 INFO kmeans.KMeansDriver: Input:
/user/hdfs/Vectors/reuters-vectors/tfidf-vectors Clusters In:
/user/hdfs/Vectors/reuters-initial-clusters/part-randomSeed Out:
/user/hdfs/Vectors/reuters-kmeans-clusters
14/11/12 15:23:58 INFO kmeans.KMeansDriver: convergence: 1.0 max
Iterations: 20
14/11/12 15:23:58 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
Exception in thread "main" java.lang.IllegalStateException: No input
clusters found in
/user/hdfs/Vectors/reuters-initial-clusters/part-randomSeed. Check your -c
argument.
       at
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:206)
       at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:140)
       at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:103)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
       at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:47)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
       at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
       at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:153)
       at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.apache.hadoop.util.RunJar.main(RunJar.java:212)


Mime
View raw message