lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Boyd <>
Subject Re: Clustering Carrot2 vs TermVector Analysis
Date Wed, 01 Jun 2005 14:27:38 GMT
Responses inline prefixed with ****

-----Original Message-----
From: Dawid Weiss <>
Sent: Jun 1, 2005 3:24 AM
Subject: Re: Clustering Carrot2 vs TermVector Analysis

Hi Andrew,

Coming up with an answer... sorry for the delay.

> By using the carrot demo: 
> I was able to easliy cluster search results based on the fields used
> by carrot( url, title, and summary). However I was wondering if there
> was a way to do something similar using term vector analysis and the
> built in TermVector / Similarity api.

Yes, most clustering methods are based just on that (term-vector
matrix). Carrot also uses this internally, but builds its own data
structure from the provided data instead of relying on Lucene's. It
shouldn't be a problem to write a clustering plugin to Carrot that
actually uses the term-vector data from Lucene.

> After doing a typical lucene search how can I get the  top 5 "key
> terms" for each of the top ten documents.  I was thinking that I sum
> these and then have a type of cluster.

The question is ill-defined, I'm afraid. "top 5 key terms" are very 
subjecting and depend on the strategy of score calculation, the way 
you're pruning stop words, etc.

**** What I meant was the top 5 terms of a document where top is based on 
wi = tfi * IDFi

I also don't get the: "each of the top ten documents". Do you mean: each 
of the ten top documents within a cluster?

**** with the carrot demo you used the top 100 documents returned from the query.  I really
meant the top n documents from the query.

**** I hope I'm more clear now.  Thanks for the response.  


P.S. Please CC me directly; I read mails to newsgroups in batches every 
few days.

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message