There is a known and also documented (https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
first #7 in Running Canopy Clustering) difference between the sequential and distributed versions
of Canopy: The distributed version runs two canopy passes whereas the sequential version does
only one. But I don't understand why the DistanceMeasures are returning different values for
Sparse and Dense vectors. I also don't understand how you are getting DenseVectors for your
centroids in the first place; which version are you running? With trunk (and I think also
0.4 since this has not been changed recently) I get RandomAccessSparseVectors from the Mapper,
even running Synthetic Control which starts with DenseVectors.
Jeff
Original Message
From: gabeweb [mailto:gabriel_webster@htc.com]
Sent: Thursday, February 10, 2011 1:43 AM
To: mahoutuser@lucene.apache.org
Subject: Problem in distributed canopy clustering
Hi, I think there is a significant problem in the distributed canopy
clusterer. I've been comparing the inmemory version to the distributed
version (clustering users in the GroupLens database), and they behave
completely differently. Firstly, different T1/T2 parameters are required to
get the same number of clusters  even when the data and similarity metric
are exactly the same. Secondly, even when I have tuned the parameters to
get the same number of clusters, the distribution of cluster sizes is very
different  in particular, using e.g. Tanimoto distance, if there are N
clusters, the distributed version likes to create N1 singleton clusters,
and put all the remaining vectors into the remaining cluster.
I have traced this to the fact that given a single similarity metric,
distances between sparse vectors tend to have a different range than
distances between dense vectors. It first clusters (sparse) original
vectors in each mapper, and then it takes the (dense) centroid vectors
output by each mapper and applies the same canopy clustering using the same
T1/T2 parameters. I confirmed this by using a single mapper and simply
turning off the clustering of the reducing step (i.e., have the reducer just
output the same centroids that are input to it); in this case, the
clustering is fine  somewhat obviously, perhaps, because this makes the
distributed algorithm behave exactly like the inmemory version.
Specifically, with Tanimoto distance and the reducer effectively turned off,
the average distance between original vectors is 0.984, and with T1 = T2 =
0.983 with 10% of the GroupLens data, I get 24 clusters. Then if I turn on
the reducer, I only get one cluster, because the average distance between
the dense centroids output by the mapper drops to 0.235, and so every
centroid is now within T1 of every other centroid. If I want a similar
number clusters in the unmodified distributed version, I have to decrease
T1/T2 to 0.939, which gives 23 clusters, but much less evenly distributed
(the largest cluster now contains 6779 vectors, which is 97% of the input
vectors, as opposed to 2684 in the inmemory/turnedoffreducer version),
due to some property of the mapper having generated many more clusters (257)
as a tradeoff for the T1/T2 now being appropriate for the different
similarity values of the reducer stage.
Is this a known shortcoming of distributed canopy? Or am I missing
something? It seems to me that for this to work, different T1/T2 parameters
would be needed for the mapper and reducer steps. That would be easy to
program, but it would make tuning the parameters a lot harder  unless
there were some clever way to automatically adjust the parameters based on
how sparse the vectors being clustered were.
Thanks.

View this message in context: http://lucene.472066.n3.nabble.com/Problemindistributedcanopyclusteringtp2464896p2464896.html
Sent from the Mahout User List mailing list archive at Nabble.com.
