lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Scoring exact matches higher in a stemmed field
Date Thu, 22 Jul 2010 18:20:45 GMT
>
> Ideally, that would be through a class or a function I can override or
> extend
>

How is that different than extending QP?

About the "song of songs" example -- the result you describe is already what
will happen. A document which contains just the word 'song' will score lower
than a document containing "song of songs". Also, what I'd do in such a case
is search for the phrase (in addition to the rest), 'cause documents
containing the word "songs" 100 times will score higher than the single
document that will contain "song of songs" once ...

If you just want a query "abc def" to rank higher if a document contains the
exact words, then I'd go w/ the QP extension approach, or do other
sophistication like searching for 'abc' '\"abc\"' etc. or something like
that. There are many tricks you can do on your end, w/o overriding much in
Lucene. Still, IMO extending QP is the easiest and gives you the control you
need.

Shai

On Mon, Jul 19, 2010 at 9:24 PM, Itamar Syn-Hershko <itamar@code972.com>wrote:

> On 19/7/2010 5:50 PM, Shai Erera wrote:
>
>> If your analyzer outputs b and b$ in the same position, then the below
>> query
>> will already be what the QP output today If you want to incorporate
>> boosting, I can suggest that you extend QP, override newTermQuery for
>> example, and if the term is a stemmed term, then set the query's boost
>> (Query.setBoost) accordingly. Would that work for you?
>>
>>
> I want to avoid overriding the QP, and do this as a pluggable extension.
> What other options do I have other than what you've suggested?
>
> Ideally, that would be through a class or a function I can override or
> extend, so each term hit while searching will be examined. By checking its
> type and text (for suffix), that interface could double its weight (or
> boost). The similarity functions I mentioned could have provided this
> ability (see below). How can this be done without them?
>
>  You'll need to check whether you want to boost terms inside phrases, or
>> entire phrases, and then override more methods from QP. But that approach
>> will get you the native product of the engine, I think.
>>
> Just to make sure we are on the same page here, here's an example (assuming
> the default tf/idf implementation in Lucene).
>
> I want to make sure anyone searching for "song of songs" will find texts
> discussing the biblical book, and have them ranked the highest, instead of
> having short texts containing one word "song" score higher.
>
> So what I do is have my stemming analyzer save the string "song of songs"
> like this, where each parenthesis represents a token position: (song song$)
> (song songs$).
>
> The part I'm missing is how to score terms with suffixes higher. The best
> approach seem to be looking at the term read by IndexReader and boost this
> finding somehow. The assumption is if IndexReader has read the term songs$
> it has been looked for, and therefore this is the exact word that has been
> queried for.
>
> Which is the best Lucene part to hijack for this mission?
>
>  Alternatively, you
>> can set a payload on the stemmed terms and incorporate that into
>> Similarity,
>> but that's more costly.
>>
>>
> I had mentioned Payloads - this will get me exactly what I want but as you
> say are quite costly when used for almost every term in the index. If I
> could replace the suffix with Payloads I would have done this (byte vs.
> byte), but I'm using the suffix for one other thing.
>
>  I don't follow that's been deprecated on Sim that you cannot use anymore?
>> All I see are 3 deprecated static methods which are related to norms ...
>>
>>
> In 2.3.2 there were these functions:
>
>    public float idf(Term term, Searcher searcher)
>
>    public float idf(Collection terms, Searcher searcher)
>
> These have been deprecated somewhere between that version and 2.9.2, and it
> seems like I could have used those for what I'm trying to do.
>
> Thanks,
>
> Itamar.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message