lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Itamar Syn-Hershko <ita...@code972.com>
Subject Re: Scoring exact matches higher in a stemmed field
Date Sat, 17 Jul 2010 18:04:05 GMT
Shai, you got it right. I want to be able to send "b bb" through the QP 
with my custom analyzer, and get back "(b b$) (b bb$)" -- 2 terms with 2 
tokens in the same position for each.

I want this to be a native product of the engine, as opposed to forcing 
this from the query end. I'm using different types of queries (Bool, 
DisMax), and I'm actually interested in using the QP itself. Instead of 
going through all sub-queries post-parsing and boosting terms ending 
with $, I want some sort of a plugin mechanism to do this for me per 
result. The easiest path would be subcalssing Similarity, if only the 
relevant functions wouldn't have been deprecated...

Are there any other ways to do so? For example, is this doable with 
function queries (since access to the actual term is required)?

Itamar.

On 16/7/2010 8:01 PM, Shai Erera wrote:
> Depends for which query no? ;)
>
> Sounds like you want to simulate the QP behavior
> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html for
> boosting. Meaning, if for the query "b" you want to simulate the query
> "b OR b$^2" and have matches of b$ count more than b, then I'd follow
> how QP does it - create the query programmatically or something (I'm
> not near the code at the moment so I cannot give a more concrete
> approach).
>
> If you want b and b$ to count the same, then that's already the
> behavior - i.e., docs containing both will score higher.
>
> If I misunderstood your question, then plea correct me.
>
> Shai
>
> On Friday, July 16, 2010, Itamar Syn-Hershko<itamar@code972.com>  wrote:
>    
>> Hi all,
>>
>>
>> Consider the following string: "the buffalo buffaloes" [1].
>>
>>
>> When passed through a stemming analyzer, the resulting token would be "buffalo buffalo"
(assuming a good stemmer).
>>
>>
>> To enable exact searches, say I mark the original term and index it at the same term
position. So "the buffalo buffaloes" ->  (buffalo buffalo$) (buffalo buffaloes$) - now
exact searches are allowed on the same field without having 2 different fields [2].
>>
>>
>> However, with this approach default scoring isn't working well. What is my best option
at upgrading a match for an exact match of this sort, also when using the same stemming analyzer,
without using payloads on the marked token?
>>
>>
>> In other words - how do I make documents containing "the buffalo buffaloes" considered
more relevant than docs containing the word "buffalo" only once?
>>
>>
>> The trick here is to boost the marked token if found at search time. While this sounds
easy to do, I can't find the best approach on implementing this - esp. since Similarity.float
Idf(Index.Term term, Searcher searcher) seem to have been deprecated for some reason.
>>
>>
>> Itamar.
>>
>>
>> [1] http://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo
:)
>>
>> [2] Rationale: http://www.code972.com/blog/2010/07/more-flexible-hebrew-indexing-hebmorph/
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>      
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>    

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message