lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Efficient Query Evaluation using a Two-Level Retrieval Process
Date Mon, 16 Nov 2009 21:41:12 GMT
Op maandag 16 november 2009 19:09:52 schreef J. Delgado:
> On Mon, Nov 16, 2009 at 9:44 AM, Earwin Burrfoot <earwin@gmail.com> wrote:
> > This algo is strictly tied to sort-by-score, if I understand it correctly.
> > Lucene has queries and sorting decoupled (except for allowOutOfOrder
> > mess), so implementing it would require some really fat hacks.
> >
> 
> According to the paper on Indexing Boolean Expression (using the WAND
> algo), sorting can be done based on scores that are determined based
> weight assignment to key-value pairs:
> 
> http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf
> 
> So I believe this can be generalized to sorting by any doc attributes
> given the proper weight assignment model
> 
> Of course, the devil-is-in-the-details :-(

Certainly. There is also the somewhat related issue on avoiding the use
of positions when not all required terms are present. For the moment
there are only ideas there:
https://issues.apache.org/jira/browse/LUCENE-1252

Regards,
Paul Elschot



> 
> -- Joaquin
> 
> 
> > On Mon, Nov 16, 2009 at 20:26, J. Delgado <joaquin.delgado@gmail.com> wrote:
> >> As I understood it setMinimumNumberShouldMatch(int min) Is used to
> >> specify a minimum number of the optional BooleanClauses which must be
> >> satisfied.
> >>
> >> I haven't seen the implementation of setMinimumNumberShouldMatch but
> >> it seems a bit different than what is intended with the WAND operator,
> >> which can take any real number as threshold θ
> >>
> >> As stated in the paper:
> >>
> >> WAND(X1,w1, . . . Xk,wk, θ) is true iff X 1≤i≤k and SUM(xiwi)≥ θ
> >>
> >> where xi is the indicator variable for Xi, that is xi =  1, if Xi is
> >> true 0, otherwise.
> >>
> >> Observe that WAND can be used to implement AND
> >> and OR via
> >> AND(X1,X2, . . .Xk) ≡ WAND(X1, 1,X2, 1, . . . Xk, 1, k),
> >> and
> >> OR(X1,X2, . ..Xk) ≡ WAND(X1, 1,X2, 1, . ..Xk, 1, 1).
> >>
> >> What I find interesting is the idea of using a first pass using the
> >> upper bound (maximal) contribution of a term on any document score and
> >> the dynamic setting of the threshold θ to skip or to fully evaluate a
> >> document..
> >>
> >> As stated in the paper:
> >>
> >> "Given this setup our preliminary scoring consists of evaluating
> >> for each document d
> >> WAND(X1,UB1,X2,UB2, . . .,Xk,UBk, θ),
> >> where Xi is an indicator variable for the presence of query term i in
> >> document d and the threshold θ is varied during
> >> the algorithm as explained below. If WAND evaluates to true, then the
> >> document d undergoes a full evaluation.
> >> The threshold θ is set dynamically by the algorithm based on the
> >> minimum score m among the top n results found so
> >> far, where n is the number of requested documents. The larger the
> >> threshold, the more documents will be skipped
> >> and thus we will need to compute full scores for fewer documents."
> >>
> >> I think its worth a try...
> >>
> >> -- Joaquin
> >>
> >> On Mon, Nov 16, 2009 at 2:54 AM, Andrzej Bialecki <ab@getopt.org> wrote:
> >>>
> >>> J. Delgado wrote:
> >>>>
> >>>> Here is the link to the paper.
> >>>> http://cis.poly.edu/westlab/papers/cntdstrb/p426-broder.pdf
> >>>>
> >>>> A more recent application of the use and extension of the WAND operator
for
> >>>> indexing of Boolean expressions:
> >>>> http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf
> >>>>
> >>>> -- Joaquin
> >>>>
> >>>>
> >>>> On Sun, Nov 15, 2009 at 11:12 PM, Shalin Shekhar Mangar <
> >>>> shalinmangar@gmail.com> wrote:
> >>>>
> >>>>> Hey Joaquin,
> >>>>>
> >>>>> The mailing list strips off attachments. Can you please upload it
somewhere
> >>>>> and give us the link?
> >>>>>
> >>>>> On Mon, Nov 16, 2009 at 12:35 PM, J. Delgado <joaquin.delgado@gmail.com
> >>>>>>
> >>>>>> wrote:
> >>>>>> Please find attached the paper on "Efficient Query Evaluation
using a
> >>>>>> Two-Level Retrieval Process". I believe that such approach may
improve
> >>>>>
> >>>>> the
> >>>>>>
> >>>>>> way Lucene/Solr evaluates queries today.
> >>>
> >>> The functionality of WAND (weak AND) is already implemented in Lucene, if
I understand it correctly - this is the BooleanQuery.setMinShouldMatch(int). Lucene implements
this probably differently from the algorithm described in the paper, so there may be still
some benefits from comparing the algorithms in Lucene's BooleanScorer[2] with this one ...
> >>>
> >>>
> >>> --
> >>> Best regards,
> >>> Andrzej Bialecki     <><
> >>>  ___. ___ ___ ___ _ _   __________________________________
> >>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >>> http://www.sigram.com  Contact: info at sigram dot com
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
> >
> >
> >
> > --
> > Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
> > Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> > ICQ: 104465785
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message