And I misconstrued your earlier remarks on cluster size vs number of
clusters. As t > 1 you will get fewer and fewer canopies as you have
observed. It actually doesn't seem like the cosine distance measure is
working very well for you.
Have you mentioned the size of your dictionary earlier? Perhaps
increasing the number of stop words that are rejected will decrease the
vector size and make clustering work better. This seems like the curse
of dimensionality at work.
On 5/31/12 11:18 AM, Pat Ferrel wrote:
> Oops, misspoke. 0 good, 1 bad for clustering at least
> For similarity 1 good 0 bad.
>
> One is a similarity value and the other a distance measure.
>
> But the primary question is how to get better canopies. I would expect
> that as the distance t gets small the number of canopies gets large
> which is what I see in the data below. Jeff suggests I try much
> smaller t to get less canopies and I will though I don't understand
> the logic. The docs are not all that similar. being from a general
> news crawl.
>
> When using the CosineDistanceMeasure in Canopy on a corpus of 150,000
> docs I get:
> t1 = t2 = 0.3 => 123094 canopies
> t1 = t2 = 0.6 => 97035 canopies
> t1 = t2 = 0.9 => 60160 canopies
>
> Obviously none of these values for t is very useful and it looks like
> I need to make t even larger, which would seem to indicate very
> loose/nondense canopies, no? For very large ts are the canopies useful?
>
> I'm trying both but the other odd thing is that it takes longer to run
> canopy on this data than to run kmeans, a lot longer.
>
> On 5/31/12 12:44 AM, Sean Owen wrote:
>> On Thu, May 31, 2012 at 12:36 AM, Pat Ferrel<pat@occamsmachete.com>
>> wrote:
>>
>>> I see
>>> double denominator = Math.sqrt(lengthSquaredp1) *
>>> Math.sqrt(lengthSquaredp2);
>>> // correct for floatingpoint rounding errors
>>> if (denominator< dotProduct) {
>>> denominator = dotProduct;
>>> }
>>> return 1.0  dotProduct / denominator;
>>>
>>> So this is going to return 1  cosine, right? So for clustering the
>>> distance 1 = very close, 0 = very far.
>>>
>>>
>> When two vectors are close, the angle between them is small, so the
>> cosine
>> is large, near 1. 0 = close, 1 = far, as expected.
>>
>
>
