mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject Re: Clustering a large crawl
Date Wed, 30 May 2012 20:26:41 GMT
The CosineDistanceMeasure returns 1 - dotProduct / denominator so it is 
returning the value you note. If the documents are very similar, then 
their distance will be small and t=0.1 could be too large to distinguish 
anything but the gross differences between the documents in the corpus. 
I'd try dropping the t-value until I get at least 50-100 clusters but I 
have no idea how small that might be.

On 5/30/12 4:11 PM, Robert Stewart wrote:
> That is a good point.   t1/t2 are distance measures but cosine is a similarity measure,
so you need to think of it as 1-cosine.
> On May 30, 2012, at 4:03 PM, Jeff Eastman wrote:
>> Have you tried much smaller values for t1=t2? Recall that the t-values specify the
distance within which a new point is assigned to an existing canopy. In the limit as t ->
 0, you should get n clusters, where n is the number of documents in your corpus.
>> On 5/30/12 1:23 PM, Pat Ferrel wrote:
>>> I have about 150,000 docs on which I ran canopy with values for t1 = t2 from
0.1 to 0.95 using the Cosine distance measure. I got results that range from 1.5 docs per
cluster to 3. In other words canopy produced a very large number of centroids, which does
not seem to represent the data very well. Trying random values for k seems to produce better
results but still spotty and hard to judge. I am at the point of giving up on canopy and so
wrote a utility to simply iterate k over some values and run the evaluators each time, but
there are currently some problems with CDbw (Inter-Cluster Density is always 0.0 for instance).
>>> This seems like such a fundamental problem that others must have found a way
to get better results. Any suggestions?

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message