lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Itamar Syn-Hershko <ita...@code972.com>
Subject Re: Scoring exact matches higher in a stemmed field
Date Mon, 19 Jul 2010 18:24:51 GMT
On 19/7/2010 5:50 PM, Shai Erera wrote:
> If your analyzer outputs b and b$ in the same position, then the below query
> will already be what the QP output today If you want to incorporate
> boosting, I can suggest that you extend QP, override newTermQuery for
> example, and if the term is a stemmed term, then set the query's boost
> (Query.setBoost) accordingly. Would that work for you?
>    
I want to avoid overriding the QP, and do this as a pluggable extension. 
What other options do I have other than what you've suggested?

Ideally, that would be through a class or a function I can override or 
extend, so each term hit while searching will be examined. By checking 
its type and text (for suffix), that interface could double its weight 
(or boost). The similarity functions I mentioned could have provided 
this ability (see below). How can this be done without them?
> You'll need to check whether you want to boost terms inside phrases, or
> entire phrases, and then override more methods from QP. But that approach
> will get you the native product of the engine, I think.
Just to make sure we are on the same page here, here's an example 
(assuming the default tf/idf implementation in Lucene).

I want to make sure anyone searching for "song of songs" will find texts 
discussing the biblical book, and have them ranked the highest, instead 
of having short texts containing one word "song" score higher.

So what I do is have my stemming analyzer save the string "song of 
songs" like this, where each parenthesis represents a token position: 
(song song$) (song songs$).

The part I'm missing is how to score terms with suffixes higher. The 
best approach seem to be looking at the term read by IndexReader and 
boost this finding somehow. The assumption is if IndexReader has read 
the term songs$ it has been looked for, and therefore this is the exact 
word that has been queried for.

Which is the best Lucene part to hijack for this mission?
> Alternatively, you
> can set a payload on the stemmed terms and incorporate that into Similarity,
> but that's more costly.
>    
I had mentioned Payloads - this will get me exactly what I want but as 
you say are quite costly when used for almost every term in the index. 
If I could replace the suffix with Payloads I would have done this (byte 
vs. byte), but I'm using the suffix for one other thing.
> I don't follow that's been deprecated on Sim that you cannot use anymore?
> All I see are 3 deprecated static methods which are related to norms ...
>    
In 2.3.2 there were these functions:

     public float idf(Term term, Searcher searcher)

     public float idf(Collection terms, Searcher searcher)

These have been deprecated somewhere between that version and 2.9.2, and 
it seems like I could have used those for what I'm trying to do.

Thanks,

Itamar.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message