lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Returning a minimum number of clusters
Date Wed, 03 May 2006 19:49:29 GMT

On May 2, 2006, at 1:55 PM, wrote:This is an issue  
of scaling the different dimensions.
>> It is more expensive to calculate similarity based on the entire   
>> document's contents rather than just a snippet chosen by the   
>> Highlighter.  However, it's presumably more accurate, and having  
>> the  Term Vectors pre-built at index time should help quite a bit.
> This varies, actually, depending on the document.  If
> you grab HTML from a portal, and use it all, pages from
> that portal will tend to cluster together.  If you just
> use snippets of text around document passages that
> match your query, you can actually get more accurate clustering  
> relative
> to your query.  It really depends if the documents are
> single-topic and coherent.  If so, use them all; if not,
> use snippets.  [You can see this problem leading the
> Google news classifier astray on occasion.]

That's both helpful and deflating.  :\  I can imagine that if you  
used the complete document vector from an html document that included  
navigation text, the navigation text would cause the clustering.   
That navigation text, which cannot practically be expunged at  
spidering/indexing time if you are naive about the document  
structure, is unlikely to show up in a snippet.

> A typical way to approximate is by only taking high TF/IDF
> terms.

Another strike against using the existing Term Vectors, as you'd have  
to look them all up in the term dictionary.  A stoplist could narrow  
things down some, but it would have to be applied at index-time if  
the terms were stemmed.

> Principal component methods are also popular (e.g.
> latent semantic indexing) to reduce dimensionality (usually
> with a least-squares fit criterion).

I imagine that reducing dimensionality isn't necessary if you're  
using only snippets.  And if you were to pre-compute LSI or similar  
at index-time, wouldn't you run into the same problems if your docs  
aren't single-topic and coherent to begin with?

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message