mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Clustering with KMeans
Date Wed, 09 Feb 2011 05:46:25 GMT
Sharath,

This sounds like your vectors are not all the same length as the ones that
were originally used to do the clustering.

On Tue, Feb 8, 2011 at 7:36 PM, sharath jagannath <
sharathjagannath@gmail.com> wrote:

> Yeah, it was not the only cluster that was formed, there were around 200
> cluster. I played around with t1 and t2 and now I have 30 clusters which I
> am using to cluster the new data points, doing it with CanopyDriver.
>
> I get the following exceptions when the CanopyDriver.clusterData tries to
> find the closest Canopy.
>
> org.apache.mahout.math.CardinalityException: Required cardinality 23 but
> got
> 1234
>
> at org.apache.mahout.math.RandomAccessSparseVector.dot(
> RandomAccessSparseVector.java:172)
>
> at
> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure.distance(
> SquaredEuclideanDistanceMeasure.java:57)
>
> at org.apache.mahout.clustering.canopy.CanopyClusterer.findClosestCanopy(
> CanopyClusterer.java:139)
>
> at
>
> org.apache.mahout.clustering.canopy.CanopyClusterer.emitPointToClosestCanopy(
> CanopyClusterer.java:129)
>
> at org.apache.mahout.clustering.canopy.ClusterMapper.map(
> ClusterMapper.java:46)
>
> at org.apache.mahout.clustering.canopy.ClusterMapper.map(
> ClusterMapper.java:1)
>
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>
>
> code which is trying to find the closest canopy:
>
>
> CanopyDriver.clusterData(conf, new Path ("test-vectors", "tfidf-vectors"),
>
> new Path (canopyCentroidsOutputPath, "clusters-0"),
> canopyCentroidsOutputPath, measure, t1, t2, false);
>
>
> * test-vectors/tfidf-vectors - path to the new test data, created using the
> previously mentioned customized data convertor and Seq2Sparse.
>
> * canopyCentroidsOutputPath, "clusters-0" - Path to the canopy centroids
> that were formed during the training phase.
>
> * measure - SquaredEuclideanDistanceMeasure, used the same even in the
> training phase.
>
> * t1 - 2000 t2 - 1900
>
> * Sequential false/true - either case it throws the cardinalityException in
> the RandomAccessSparseVector.dot method.
>
> dot method's first line of code is the cardinality comparison which throws
> the exception.  I wanted to use canopyClustering as a quick "online"
> clustering of the new data points(though not accurate compared to KMeans).
> Am
> I not supposed to use canopy that way?
>
>
> Thanks everybody, especially Kate. Your response to the previous emails are
> much appreciated.
>
>
> Thanks and Regards,
>
> Sharath
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message