lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Soeren Pekrul <>
Subject Re: How to get Term Weights (document term matrix)?
Date Sat, 04 Nov 2006 09:51:12 GMT
Chris Hostetter wrote:
> You really, *REALLY* don't wnat to be doing this using the "Hits" class
> like in your example ...
>    1) this will re-execute your search behind the scenes many many times
>    2) the scores returnd by "Hits" are psuedo-normalized ... they will be
>       meaningless for any sort of comparison.

Thank you very much Hoss.

> if your concern is making sure that the score you get back matches the
> score you would get from executing a search even if you change the
> Similarity, you could just make sure you use the lengthNorm and tf
> functions from the SImilarity class just like TermScorer does 

That sounds very good. The term frequency and the document frequency can 
I get from the IndexReader. The number of tokens in a field (numTokens) 
for the Similarity.lengthNorm function can I get from the term vector 
(TermFreqVector) or I use the IndexReader.norms(String field).

The usage of TermQuery in my previous example is a simplification. The 
documents of my collection have some fields like title, abstract or 
keywords. The term weights in my document term matrix should include all 
fields of a document for a word (token). So I used in reality a 
BooleanQuery that combines the possible TermQueries for a word. 
Of-course, I can sum the field weights of a term.

> ... or you
> could keep executing a TermQuery for each term like you are now, but using
> a HitCollector so you get the raw score)
> take a look at the methods that take in a HitCollector.

That seems to be the easiest way for my BooleanQuery. I will start with 
this and change my current implementation.

Have a nice weekend.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message