lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: Returning a minimum number of clusters
Date Tue, 02 May 2006 20:55:06 GMT
Marvin Humphrey wrote:

>> BTW, clustering in Information Retrieval usually implies grouping  by 
>> vector distance using statistical methods:

In general, all you need is objects with
a pairwise similarity (dissimilarity) measure.
With (term) vectors, that's usually one of
the multitude of TF/IDF cosine measures, whereas
in other machine learning apps it's typically
Euclidean distance (often z-score normalized to
scale the dimensions).

For the more sophisticated clustering algorithms,
like EM (soft/model-based) clustering, you can
use similarities between clusters (instead of
deriving these from similarities between items).

> Exactly.  I'd scanned this, but I haven't yet familiarized myself  with 
> the different models.
> It may be possible for both keyword fields e.g. "host" and non- keyword 
> fields e.g. "content" to be clustered using the same  algorithm and an 
> interface like Hits.cluster(String fieldname, int  docsPerCluster).  
> Retrieve each hit's vector for the specified field,  and map the docs 
> into a unified term space, then cluster.   For  "host" or any other 
> keyword field, the boundaries will be stark and  the cost of calculation 
> negligible.  For "content", a more  sophisticated model will be required 
> to group the docs and the cost  will be greater.

This is an issue of scaling the different dimensions.
You can "boost" the dimensions any way you want just
like other vector-based search operations.

> It is more expensive to calculate similarity based on the entire  
> document's contents rather than just a snippet chosen by the  
> Highlighter.  However, it's presumably more accurate, and having the  
> Term Vectors pre-built at index time should help quite a bit. 

This varies, actually, depending on the document.  If
you grab HTML from a portal, and use it all, pages from
that portal will tend to cluster together.  If you just
use snippets of text around document passages that
match your query, you can actually get more accurate clustering relative
to your query.  It really depends if the documents are
single-topic and coherent.  If so, use them all; if not,
use snippets.  [You can see this problem leading the
Google news classifier astray on occasion.]

A typical way to approximate is by only taking high TF/IDF
terms.  Principal component methods are also popular (e.g.
latent semantic indexing) to reduce dimensionality (usually
with a least-squares fit criterion).

A more extreme way to approximate is with signature
files (e.g. to do web-scale "more documents like this"),
but Lucene's not going to help you there.  Check out
"Managing Gigabytes" for more on this approach.

- Bob Carpenter

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message