lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Itamar Syn-Hershko <>
Subject Re: Scoring exact matches higher in a stemmed field
Date Thu, 22 Jul 2010 21:44:44 GMT
On 22/7/2010 9:20 PM, Shai Erera wrote:
> How is that different than extending QP?
Mainly because the problem I'm having isn't there, and doing it from 
there doesn't feel right, and definitely not like solving the issue. I 
want to explore what other options there are before doing anything, and 
I started this thread because I hit a dead end after seeing Similarity 
can no more be of help.

> About the "song of songs" example -- the result you describe is already what
> will happen. A document which contains just the word 'song' will score lower
> than a document containing "song of songs".
Incorrect, and I have a sample app to show that (this is how I thought 
of this example for the first place).

Since while indexing the 2 words will be saved into index as 1:(song 
song$) 2:(song songs$), short documents with one word "song" will score 
higher than longer documents with "song of songs". This is a product of 
Lucene's default tf/idf implementation which cares about a field's 
length, and at this stage I want to avoid replacing it (with BM25 for 

> Also, what I'd do in such a case
> is search for the phrase (in addition to the rest), 'cause documents
> containing the word "songs" 100 times will score higher than the single
> document that will contain "song of songs" once ...
In one of my applications I am providing an "as typed" capability, which 
does exactly what you are suggesting (looking for the $-ed terms only), 
but I want my original analyzer (the one that also looks for non $-ed 
terms) to do better scoring. Without this the implementation is somewhat 

> If you just want a query "abc def" to rank higher if a document contains the
> exact words, then I'd go w/ the QP extension approach, or do other
> sophistication like searching for 'abc' '\"abc\"' etc. or something like
> that. There are many tricks you can do on your end, w/o overriding much in
> Lucene. Still, IMO extending QP is the easiest and gives you the control you
> need.
I am overriding stuff in Lucene either way. I also don't want an exact 
match of a phrase to rank higher; I want an original term (saved as-is 
with a $ marker) to score higher than a stemmed / lemmatized one 
(without the marker). Sorry if the thread's title is misleading.

I'd have used payloads if it wasn't costly. So my question is: where do 
I have control over boosting (or scoring), and also have access to the 
term's text?



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message