mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Problem in distributed canopy clustering
Date Thu, 10 Feb 2011 17:49:43 GMT
Thanks for a careful analysis and well-written comment.

On Thu, Feb 10, 2011 at 1:42 AM, gabeweb <gabriel_webster@htc.com> wrote:

>
> Hi, I think there is a significant problem in the distributed canopy
> clusterer.  I've been comparing the in-memory version to the distributed
> version (clustering users in the GroupLens database), and they behave
> completely differently.  Firstly, different T1/T2 parameters are required
> to
> get the same number of clusters -- even when the data and similarity metric
> are exactly the same.  Secondly, even when I have tuned the parameters to
> get the same number of clusters, the distribution of cluster sizes is very
> different -- in particular, using e.g. Tanimoto distance, if there are N
> clusters, the distributed version likes to create N-1 singleton clusters,
> and put all the remaining vectors into the remaining cluster.
>
> I have traced this to the fact that given a single similarity metric,
> distances between sparse vectors tend to have a different range than
> distances between dense vectors.  It first clusters (sparse) original
> vectors in each mapper, and then it takes the (dense) centroid vectors
> output by each mapper and applies the same canopy clustering using the same
> T1/T2 parameters.  I confirmed this by using a single mapper and simply
> turning off the clustering of the reducing step (i.e., have the reducer
> just
> output the same centroids that are input to it); in this case, the
> clustering is fine -- somewhat obviously, perhaps, because this makes the
> distributed algorithm behave exactly like the in-memory version.
> Specifically, with Tanimoto distance and the reducer effectively turned
> off,
> the average distance between original vectors is 0.984, and with T1 = T2 =
> 0.983 with 10% of the GroupLens data, I get 24 clusters.  Then if I turn on
> the reducer, I only get one cluster, because the average distance between
> the dense centroids output by the mapper drops to 0.235, and so every
> centroid is now within T1 of every other centroid.  If I want a similar
> number clusters in the unmodified distributed version, I have to decrease
> T1/T2 to 0.939, which gives 23 clusters, but much less evenly distributed
> (the largest cluster now contains 6779 vectors, which is 97% of the input
> vectors, as opposed to 2684 in the in-memory/turned-off-reducer version),
> due to some property of the mapper having generated many more clusters
> (257)
> as a trade-off for the T1/T2 now being appropriate for the different
> similarity values of the reducer stage.
>
> Is this a known shortcoming of distributed canopy?  Or am I missing
> something?  It seems to me that for this to work, different T1/T2
> parameters
> would be needed for the mapper and reducer steps.  That would be easy to
> program, but it would make tuning the parameters a lot harder -- unless
> there were some clever way to automatically adjust the parameters based on
> how sparse the vectors being clustered were.
>
> Thanks.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Problem-in-distributed-canopy-clustering-tp2464896p2464896.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message