Are you saying that
1. you threw out all but the top 1000 terms per document by weight? or
2. your dictionary has only 1000 terms in it and you threw all others
away?
The later is a simple dimensional reduction trick to try, but 1000 seems
low to me for the entire dictionary.
A question for you about similarity. I wonder if using all terms is
better for the similarity measure? What is noise in clustering may be
important when looking at cooccurrences. What do you think?
On 5/31/12 4:20 PM, Suneel Marthi wrote:
> Pat,
>
> We have been trying to do something very similar to what u r trying to
> accomplish and we ended up with better clusters by considering only
> the top 1000 terms (by tfidf weight) per doc and using Tanimoto
> distance.
>
> Definitely give dimensionality reduction a try and let us know how it
> works out.
>
> 
> *From:* Pat Ferrel <pat@occamsmachete.com>
> *To:* user@mahout.apache.org
> *Sent:* Thursday, May 31, 2012 6:42 PM
> *Subject:* Re: Clustering a large crawl
>
> Yeah, that's the conclusion I was coming to but thought I'd ask the
> experts. My dictionary is petty big. the last time I looked it was
> over 100,000 terms even with ngrams, lucene stop words, no numbers,
> and stemming. I've tried Tanimoto too with similar results.
>
> Dimensional reduction seems like the next thing to try.
>
> Pat
>
>
> Further data from 150,000 docs. Using Canopy clustering I get these values
> t1 = t2 = 0.3 => 123094 canopies
> t1 = t2 = 0.6 => 97035 canopies
> t1 = t2 = 0.9 => 60160 canopies
> t1 = t2 = 0.91 => 59491 canopies
> t1 = t2 = 0.93 => 58526 canopies
> t1 = t2 = 0.95 => 57854 canopies
> t1 = t2 = 0.97 => 57244 canopies
> t1 = t2 = 0.99 => 56241 canopies
>
>
>
> On 5/31/12 2:31 PM, Jeff Eastman wrote:
>> And I misconstrued your earlier remarks on cluster size vs number of
>> clusters. As t > 1 you will get fewer and fewer canopies as you have
>> observed. It actually doesn't seem like the cosine distance measure
>> is working very well for you.
>>
>> Have you mentioned the size of your dictionary earlier? Perhaps
>> increasing the number of stop words that are rejected will decrease
>> the vector size and make clustering work better. This seems like the
>> curse of dimensionality at work.
>>
>> On 5/31/12 11:18 AM, Pat Ferrel wrote:
>>> Oops, misspoke. 0 good, 1 bad for clustering at least
>>> For similarity 1 good 0 bad.
>>>
>>> One is a similarity value and the other a distance measure.
>>>
>>> But the primary question is how to get better canopies. I would
>>> expect that as the distance t gets small the number of canopies gets
>>> large which is what I see in the data below. Jeff suggests I try
>>> much smaller t to get less canopies and I will though I don't
>>> understand the logic. The docs are not all that similar. being from
>>> a general news crawl.
>>>
>>> When using the CosineDistanceMeasure in Canopy on a corpus of
>>> 150,000 docs I get:
>>> t1 = t2 = 0.3 => 123094 canopies
>>> t1 = t2 = 0.6 => 97035 canopies
>>> t1 = t2 = 0.9 => 60160 canopies
>>>
>>> Obviously none of these values for t is very useful and it looks
>>> like I need to make t even larger, which would seem to indicate very
>>> loose/nondense canopies, no? For very large ts are the canopies
>>> useful?
>>>
>>> I'm trying both but the other odd thing is that it takes longer to
>>> run canopy on this data than to run kmeans, a lot longer.
>>>
>>> On 5/31/12 12:44 AM, Sean Owen wrote:
>>>> On Thu, May 31, 2012 at 12:36 AM, Pat Ferrel<pat@occamsmachete.com>
>>>> <mailto:pat@occamsmachete.com> wrote:
>>>>
>>>>> I see
>>>>> double denominator = Math.sqrt(lengthSquaredp1) *
>>>>> Math.sqrt(lengthSquaredp2);
>>>>> // correct for floatingpoint rounding errors
>>>>> if (denominator< dotProduct) {
>>>>> denominator = dotProduct;
>>>>> }
>>>>> return 1.0  dotProduct / denominator;
>>>>>
>>>>> So this is going to return 1  cosine, right? So for clustering the
>>>>> distance 1 = very close, 0 = very far.
>>>>>
>>>>>
>>>> When two vectors are close, the angle between them is small, so the
>>>> cosine
>>>> is large, near 1. 0 = close, 1 = far, as expected.
>>>>
>>>
>>>
>>
>
>
