mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject Re: Clustering a large crawl
Date Wed, 30 May 2012 20:03:11 GMT
Have you tried much smaller values for t1=t2? Recall that the t-values 
specify the distance within which a new point is assigned to an existing 
canopy. In the limit as t -> 0, you should get n clusters, where n is 
the number of documents in your corpus.

On 5/30/12 1:23 PM, Pat Ferrel wrote:
> I have about 150,000 docs on which I ran canopy with values for t1 = 
> t2 from 0.1 to 0.95 using the Cosine distance measure. I got results 
> that range from 1.5 docs per cluster to 3. In other words canopy 
> produced a very large number of centroids, which does not seem to 
> represent the data very well. Trying random values for k seems to 
> produce better results but still spotty and hard to judge. I am at the 
> point of giving up on canopy and so wrote a utility to simply iterate 
> k over some values and run the evaluators each time, but there are 
> currently some problems with CDbw (Inter-Cluster Density is always 0.0 
> for instance).
> This seems like such a fundamental problem that others must have found 
> a way to get better results. Any suggestions?

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message