lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Scoring exact matches higher in a stemmed field
Date Mon, 19 Jul 2010 14:50:51 GMT
If your analyzer outputs b and b$ in the same position, then the below query
will already be what the QP output today If you want to incorporate
boosting, I can suggest that you extend QP, override newTermQuery for
example, and if the term is a stemmed term, then set the query's boost
(Query.setBoost) accordingly. Would that work for you?

You'll need to check whether you want to boost terms inside phrases, or
entire phrases, and then override more methods from QP. But that approach
will get you the native product of the engine, I think. Alternatively, you
can set a payload on the stemmed terms and incorporate that into Similarity,
but that's more costly.

I don't follow that's been deprecated on Sim that you cannot use anymore?
All I see are 3 deprecated static methods which are related to norms ...

Shai

On Sat, Jul 17, 2010 at 9:04 PM, Itamar Syn-Hershko <itamar@code972.com>wrote:

> Shai, you got it right. I want to be able to send "b bb" through the QP
> with my custom analyzer, and get back "(b b$) (b bb$)" -- 2 terms with 2
> tokens in the same position for each.
>
> I want this to be a native product of the engine, as opposed to forcing
> this from the query end. I'm using different types of queries (Bool,
> DisMax), and I'm actually interested in using the QP itself. Instead of
> going through all sub-queries post-parsing and boosting terms ending with $,
> I want some sort of a plugin mechanism to do this for me per result. The
> easiest path would be subcalssing Similarity, if only the relevant functions
> wouldn't have been deprecated...
>
> Are there any other ways to do so? For example, is this doable with
> function queries (since access to the actual term is required)?
>
> Itamar.
>
> On 16/7/2010 8:01 PM, Shai Erera wrote:
>
>> Depends for which query no? ;)
>>
>> Sounds like you want to simulate the QP behavior
>> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html for
>> boosting. Meaning, if for the query "b" you want to simulate the query
>> "b OR b$^2" and have matches of b$ count more than b, then I'd follow
>> how QP does it - create the query programmatically or something (I'm
>> not near the code at the moment so I cannot give a more concrete
>> approach).
>>
>> If you want b and b$ to count the same, then that's already the
>> behavior - i.e., docs containing both will score higher.
>>
>> If I misunderstood your question, then plea correct me.
>>
>> Shai
>>
>> On Friday, July 16, 2010, Itamar Syn-Hershko<itamar@code972.com>  wrote:
>>
>>
>>> Hi all,
>>>
>>>
>>> Consider the following string: "the buffalo buffaloes" [1].
>>>
>>>
>>> When passed through a stemming analyzer, the resulting token would be
>>> "buffalo buffalo" (assuming a good stemmer).
>>>
>>>
>>> To enable exact searches, say I mark the original term and index it at
>>> the same term position. So "the buffalo buffaloes" ->  (buffalo buffalo$)
>>> (buffalo buffaloes$) - now exact searches are allowed on the same field
>>> without having 2 different fields [2].
>>>
>>>
>>> However, with this approach default scoring isn't working well. What is
>>> my best option at upgrading a match for an exact match of this sort, also
>>> when using the same stemming analyzer, without using payloads on the marked
>>> token?
>>>
>>>
>>> In other words - how do I make documents containing "the buffalo
>>> buffaloes" considered more relevant than docs containing the word "buffalo"
>>> only once?
>>>
>>>
>>> The trick here is to boost the marked token if found at search time.
>>> While this sounds easy to do, I can't find the best approach on implementing
>>> this - esp. since Similarity.float Idf(Index.Term term, Searcher searcher)
>>> seem to have been deprecated for some reason.
>>>
>>>
>>> Itamar.
>>>
>>>
>>> [1]
>>> http://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo:)
>>>
>>> [2] Rationale:
>>> http://www.code972.com/blog/2010/07/more-flexible-hebrew-indexing-hebmorph/
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message