Thanks for a careful analysis and wellwritten comment.
On Thu, Feb 10, 2011 at 1:42 AM, gabeweb <gabriel_webster@htc.com> wrote:
>
> Hi, I think there is a significant problem in the distributed canopy
> clusterer. I've been comparing the inmemory version to the distributed
> version (clustering users in the GroupLens database), and they behave
> completely differently. Firstly, different T1/T2 parameters are required
> to
> get the same number of clusters  even when the data and similarity metric
> are exactly the same. Secondly, even when I have tuned the parameters to
> get the same number of clusters, the distribution of cluster sizes is very
> different  in particular, using e.g. Tanimoto distance, if there are N
> clusters, the distributed version likes to create N1 singleton clusters,
> and put all the remaining vectors into the remaining cluster.
>
> I have traced this to the fact that given a single similarity metric,
> distances between sparse vectors tend to have a different range than
> distances between dense vectors. It first clusters (sparse) original
> vectors in each mapper, and then it takes the (dense) centroid vectors
> output by each mapper and applies the same canopy clustering using the same
> T1/T2 parameters. I confirmed this by using a single mapper and simply
> turning off the clustering of the reducing step (i.e., have the reducer
> just
> output the same centroids that are input to it); in this case, the
> clustering is fine  somewhat obviously, perhaps, because this makes the
> distributed algorithm behave exactly like the inmemory version.
> Specifically, with Tanimoto distance and the reducer effectively turned
> off,
> the average distance between original vectors is 0.984, and with T1 = T2 =
> 0.983 with 10% of the GroupLens data, I get 24 clusters. Then if I turn on
> the reducer, I only get one cluster, because the average distance between
> the dense centroids output by the mapper drops to 0.235, and so every
> centroid is now within T1 of every other centroid. If I want a similar
> number clusters in the unmodified distributed version, I have to decrease
> T1/T2 to 0.939, which gives 23 clusters, but much less evenly distributed
> (the largest cluster now contains 6779 vectors, which is 97% of the input
> vectors, as opposed to 2684 in the inmemory/turnedoffreducer version),
> due to some property of the mapper having generated many more clusters
> (257)
> as a tradeoff for the T1/T2 now being appropriate for the different
> similarity values of the reducer stage.
>
> Is this a known shortcoming of distributed canopy? Or am I missing
> something? It seems to me that for this to work, different T1/T2
> parameters
> would be needed for the mapper and reducer steps. That would be easy to
> program, but it would make tuning the parameters a lot harder  unless
> there were some clever way to automatically adjust the parameters based on
> how sparse the vectors being clustered were.
>
> Thanks.
> 
> View this message in context:
> http://lucene.472066.n3.nabble.com/Problemindistributedcanopyclusteringtp2464896p2464896.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
