True and lucene stop words are small in number. Lucene I assume wants to
be able to search on words that are only important in the search phrase,
not to TFIDF so they have only a small number. I got a big list of stop
words from another source. But as you say it only helps a little.
In the same vein as Hapax_legomenon...
On a previous project we used a seperate IDF calculation for each
domain. The domain specific IDF weighted frequent terms specific to the
site lower, reducing the impact of meanless catchphrases or site
specific jargon that had did not apply globally. There does not seem to
be an easy way to do this with mahout. There were some issues with doing
domainspecific IDF anyway, like throwing out important words when they
weren't meaningless catch phrases.
Lance, Yes, I understand that ngrams give more terms, bad phrasing on
my part. I have a very high mll though so it shouldn't increase the
terms by very much.
After looking again at the dictionary for 150,000 web pages I have
259,000 dimensions! Part of the problem is I can't get Tika to detect
language very well (working on this) so I get groups of nonenglish
pages that throw in quite a few new terms. Overall I think some form of
dimensional reduction is called for, no?
On 6/3/12 12:34 AM, Ted Dunning wrote:
> Also, stop words usually only makes it minutely smaller.
>
> What really makes a vocabulary smaller is eliminating
> hapax<http://en.wikipedia.org/wiki/Hapax_legomenon>
> .
>
> On Sun, Jun 3, 2012 at 9:23 AM, Lance Norskog<goksron@gmail.com> wrote:
>
>> "100,000 terms even with ngrams, "...
>>
>> Ummmm... Ngrams will make it bigger, not smaller :)
>>
>> I haven't studied the text workflows lately. Is there a place where you get
>> counts for all words? If so, you can just pick the smallest N counts and
>> make a stopword list out of them. This would be a highly valued addition to
>> the workflows.
>>
>> On Fri, Jun 1, 2012 at 9:36 AM, Ted Dunning<ted.dunning@gmail.com> wrote:
>>
>>> I am pretty sure that Suneel meant keep the top 1000 terms per document.
>>>
>>>
>>> On Fri, Jun 1, 2012 at 2:21 AM, Pat Ferrel<pat@occamsmachete.com>
>> wrote:
>>>> Are you saying that
>>>> 1. you threw out all but the top 1000 terms per document by weight? or
>>>> 2. your dictionary has only 1000 terms in it and you threw all others
>>>> away?
>>>>
>>>> The later is a simple dimensional reduction trick to try, but 1000 seems
>>>> low to me for the entire dictionary.
>>>>
>>>> A question for you about similarity. I wonder if using all terms is
>>>> better for the similarity measure? What is noise in clustering may be
>>>> important when looking at cooccurrences. What do you think?
>>>>
>>>>
>>>> On 5/31/12 4:20 PM, Suneel Marthi wrote:
>>>>
>>>> Pat,
>>>>
>>>> We have been trying to do something very similar to what u r trying to
>>>> accomplish and we ended up with better clusters by considering only the
>> top
>>>> 1000 terms (by tfidf weight) per doc and using Tanimoto distance.
>>>>
>>>> Definitely give dimensionality reduction a try and let us know how it
>>>> works out.
>>>>
>>>> 
>>>> *From:* Pat Ferrel<pat@occamsmachete.com> <pat@occamsmachete.com>
>>>> *To:* user@mahout.apache.org
>>>> *Sent:* Thursday, May 31, 2012 6:42 PM
>>>> *Subject:* Re: Clustering a large crawl
>>>>
>>>> Yeah, that's the conclusion I was coming to but thought I'd ask the
>>>> experts. My dictionary is petty big. the last time I looked it was over
>>>> 100,000 terms even with ngrams, lucene stop words, no numbers, and
>>>> stemming. I've tried Tanimoto too with similar results.
>>>>
>>>> Dimensional reduction seems like the next thing to try.
>>>>
>>>> Pat
>>>>
>>>>
>>>> Further data from 150,000 docs. Using Canopy clustering I get these
>> values
>>>> t1 = t2 = 0.3 => 123094 canopies
>>>> t1 = t2 = 0.6 => 97035 canopies
>>>> t1 = t2 = 0.9 => 60160 canopies
>>>> t1 = t2 = 0.91 => 59491 canopies
>>>> t1 = t2 = 0.93 => 58526 canopies
>>>> t1 = t2 = 0.95 => 57854 canopies
>>>> t1 = t2 = 0.97 => 57244 canopies
>>>> t1 = t2 = 0.99 => 56241 canopies
>>>>
>>>>
>>>>
>>>> On 5/31/12 2:31 PM, Jeff Eastman wrote:
>>>>
>>>> And I misconstrued your earlier remarks on cluster size vs number of
>>>> clusters. As t > 1 you will get fewer and fewer canopies as you have
>>>> observed. It actually doesn't seem like the cosine distance measure is
>>>> working very well for you.
>>>>
>>>> Have you mentioned the size of your dictionary earlier? Perhaps
>>>> increasing the number of stop words that are rejected will decrease the
>>>> vector size and make clustering work better. This seems like the curse
>> of
>>>> dimensionality at work.
>>>>
>>>> On 5/31/12 11:18 AM, Pat Ferrel wrote:
>>>>
>>>> Oops, misspoke. 0 good, 1 bad for clustering at least
>>>> For similarity 1 good 0 bad.
>>>>
>>>> One is a similarity value and the other a distance measure.
>>>>
>>>> But the primary question is how to get better canopies. I would expect
>>>> that as the distance t gets small the number of canopies gets large
>> which
>>>> is what I see in the data below. Jeff suggests I try much smaller t to
>> get
>>>> less canopies and I will though I don't understand the logic. The docs
>> are
>>>> not all that similar. being from a general news crawl.
>>>>
>>>> When using the CosineDistanceMeasure in Canopy on a corpus of 150,000
>>>> docs I get:
>>>> t1 = t2 = 0.3 => 123094 canopies
>>>> t1 = t2 = 0.6 => 97035 canopies
>>>> t1 = t2 = 0.9 => 60160 canopies
>>>>
>>>> Obviously none of these values for t is very useful and it looks like I
>>>> need to make t even larger, which would seem to indicate very
>>>> loose/nondense canopies, no? For very large ts are the canopies useful?
>>>>
>>>> I'm trying both but the other odd thing is that it takes longer to run
>>>> canopy on this data than to run kmeans, a lot longer.
>>>>
>>>> On 5/31/12 12:44 AM, Sean Owen wrote:
>>>>
>>>> On Thu, May 31, 2012 at 12:36 AM, Pat Ferrel<pat@occamsmachete.com><
>> pat@occamsmachete.com>
>>>> wrote:
>>>>
>>>> I see
>>>> double denominator = Math.sqrt(lengthSquaredp1) *
>>>> Math.sqrt(lengthSquaredp2);
>>>> // correct for floatingpoint rounding errors
>>>> if (denominator< dotProduct) {
>>>> denominator = dotProduct;
>>>> }
>>>> return 1.0  dotProduct / denominator;
>>>>
>>>> So this is going to return 1  cosine, right? So for clustering the
>>>> distance 1 = very close, 0 = very far.
>>>>
>>>>
>>>> When two vectors are close, the angle between them is small, so the
>>>> cosine
>>>> is large, near 1. 0 = close, 1 = far, as expected.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>> 
>> Lance Norskog
>> goksron@gmail.com
>>
