mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sharath jagannath <sharathjagann...@gmail.com>
Subject Re: Clustering with KMeans
Date Wed, 09 Feb 2011 03:36:48 GMT
Yeah, it was not the only cluster that was formed, there were around 200
cluster. I played around with t1 and t2 and now I have 30 clusters which I
am using to cluster the new data points, doing it with CanopyDriver.

I get the following exceptions when the CanopyDriver.clusterData tries to
find the closest Canopy.

org.apache.mahout.math.CardinalityException: Required cardinality 23 but got
1234

at org.apache.mahout.math.RandomAccessSparseVector.dot(
RandomAccessSparseVector.java:172)

at
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure.distance(
SquaredEuclideanDistanceMeasure.java:57)

at org.apache.mahout.clustering.canopy.CanopyClusterer.findClosestCanopy(
CanopyClusterer.java:139)

at
org.apache.mahout.clustering.canopy.CanopyClusterer.emitPointToClosestCanopy(
CanopyClusterer.java:129)

at org.apache.mahout.clustering.canopy.ClusterMapper.map(
ClusterMapper.java:46)

at org.apache.mahout.clustering.canopy.ClusterMapper.map(
ClusterMapper.java:1)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)


code which is trying to find the closest canopy:


CanopyDriver.clusterData(conf, new Path ("test-vectors", "tfidf-vectors"),

new Path (canopyCentroidsOutputPath, "clusters-0"),
canopyCentroidsOutputPath, measure, t1, t2, false);


* test-vectors/tfidf-vectors - path to the new test data, created using the
previously mentioned customized data convertor and Seq2Sparse.

* canopyCentroidsOutputPath, "clusters-0" - Path to the canopy centroids
that were formed during the training phase.

* measure - SquaredEuclideanDistanceMeasure, used the same even in the
training phase.

* t1 - 2000 t2 - 1900

* Sequential false/true - either case it throws the cardinalityException in
the RandomAccessSparseVector.dot method.

dot method's first line of code is the cardinality comparison which throws
the exception.  I wanted to use canopyClustering as a quick "online"
clustering of the new data points(though not accurate compared to KMeans). Am
I not supposed to use canopy that way?


Thanks everybody, especially Kate. Your response to the previous emails are
much appreciated.


Thanks and Regards,

Sharath

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message