lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Earwin Burrfoot <ear...@gmail.com>
Subject Re: Efficient Query Evaluation using a Two-Level Retrieval Process
Date Mon, 16 Nov 2009 17:44:08 GMT
This algo is strictly tied to sort-by-score, if I understand it correctly.
Lucene has queries and sorting decoupled (except for allowOutOfOrder
mess), so implementing it would require some really fat hacks.

On Mon, Nov 16, 2009 at 20:26, J. Delgado <joaquin.delgado@gmail.com> wrote:
> As I understood it setMinimumNumberShouldMatch(int min) Is used to
> specify a minimum number of the optional BooleanClauses which must be
> satisfied.
>
> I haven't seen the implementation of setMinimumNumberShouldMatch but
> it seems a bit different than what is intended with the WAND operator,
> which can take any real number as threshold θ
>
> As stated in the paper:
>
> WAND(X1,w1, . . . Xk,wk, θ) is true iff X 1≤i≤k and SUM(xiwi)≥ θ
>
> where xi is the indicator variable for Xi, that is xi =  1, if Xi is
> true 0, otherwise.
>
> Observe that WAND can be used to implement AND
> and OR via
> AND(X1,X2, . . .Xk) ≡ WAND(X1, 1,X2, 1, . . . Xk, 1, k),
> and
> OR(X1,X2, . ..Xk) ≡ WAND(X1, 1,X2, 1, . ..Xk, 1, 1).
>
> What I find interesting is the idea of using a first pass using the
> upper bound (maximal) contribution of a term on any document score and
> the dynamic setting of the threshold θ to skip or to fully evaluate a
> document..
>
> As stated in the paper:
>
> "Given this setup our preliminary scoring consists of evaluating
> for each document d
> WAND(X1,UB1,X2,UB2, . . .,Xk,UBk, θ),
> where Xi is an indicator variable for the presence of query term i in
> document d and the threshold θ is varied during
> the algorithm as explained below. If WAND evaluates to true, then the
> document d undergoes a full evaluation.
> The threshold θ is set dynamically by the algorithm based on the
> minimum score m among the top n results found so
> far, where n is the number of requested documents. The larger the
> threshold, the more documents will be skipped
> and thus we will need to compute full scores for fewer documents."
>
> I think its worth a try...
>
> -- Joaquin
>
> On Mon, Nov 16, 2009 at 2:54 AM, Andrzej Bialecki <ab@getopt.org> wrote:
>>
>> J. Delgado wrote:
>>>
>>> Here is the link to the paper.
>>> http://cis.poly.edu/westlab/papers/cntdstrb/p426-broder.pdf
>>>
>>> A more recent application of the use and extension of the WAND operator for
>>> indexing of Boolean expressions:
>>> http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf
>>>
>>> -- Joaquin
>>>
>>>
>>> On Sun, Nov 15, 2009 at 11:12 PM, Shalin Shekhar Mangar <
>>> shalinmangar@gmail.com> wrote:
>>>
>>>> Hey Joaquin,
>>>>
>>>> The mailing list strips off attachments. Can you please upload it somewhere
>>>> and give us the link?
>>>>
>>>> On Mon, Nov 16, 2009 at 12:35 PM, J. Delgado <joaquin.delgado@gmail.com
>>>>>
>>>>> wrote:
>>>>> Please find attached the paper on "Efficient Query Evaluation using a
>>>>> Two-Level Retrieval Process". I believe that such approach may improve
>>>>
>>>> the
>>>>>
>>>>> way Lucene/Solr evaluates queries today.
>>
>> The functionality of WAND (weak AND) is already implemented in Lucene, if I understand
it correctly - this is the BooleanQuery.setMinShouldMatch(int). Lucene implements this probably
differently from the algorithm described in the paper, so there may be still some benefits
from comparing the algorithms in Lucene's BooleanScorer[2] with this one ...
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message