mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Stewart <>
Subject Re: choosing appropriate t1,t2 for canopy clustering
Date Tue, 15 May 2012 15:36:57 GMT
Thanks Jeff.  I do see that cosine distance does return 0.0-1.0 now as expected.  Something
else was wrong in my initial run I guess.  

A different question about k-means:  I can successfully cluster using k-means but what happens
is some clusters are very unrelated, so it seems like there needs to be some distance threshold
to cluster documents using k-means (so clusters with very dis-similar items just dont get
put into any cluster).  Is that possible with mahout?  I dont see any type of threshold parameters
for k-means.

On May 15, 2012, at 11:16 AM, Jeff Eastman wrote:

> Hi Bob,
> Cosine distance will return distances on 0.0...1.0 as you suggest. While there is no
absolutely foolproof technique for priming canopy T1 & T2 values I recommend you begin
by setting T1==T2 and doing a binary search from some initial distance, perhaps 0.1. If you
get too few clusters, decrease T1==T2 by half and try again. If too many, double etc.
> If you want to be more analytical, use the RandomSeedGenerator to sample from your input
vectors and compute a starting point using their inter-cluster distances. You can also skip
Canopy and use k-means with -k specified to sample from your input data and produce k clusters.
That works pretty well with text and Cosine distance
> Once you arrive at a "reasonable" number of clusters, you can mess with T1 to include
more points in the centroid calculations but that will not change the number of clusters.
> On 5/15/12 10:45 AM, Robert Stewart wrote:
>> I am trying to run canopy clustering on vectors extracted from lucene index.  I want
to use CosineDistanceMeasure.  How do I know what appropriate values to use for t1 and t2
distance threshold?  I would assume that Cosine distance measure would return "distances"
as a range from 0.0 to 1.0 but that seems not the case, so how do I know what the potential
distance ranges are to pick t1 and t2 (other than many trial and errors)?
>> Thanks
>> Bob

View raw message