lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lukai <lukai1...@gmail.com>
Subject Re: Long query optimisation: using some terms for scoring only
Date Tue, 11 Dec 2012 17:45:08 GMT
I had implemented WAND in solr for our own project. It can improve the
performance a lot. For your reference:
http://dl.acm.org/citation.cfm?id=956944

But it needs to change index a little bit.

Thanks,


On Tue, Dec 11, 2012 at 6:19 AM, Matthew Willson <matthew@swiftkey.net>wrote:

> Hi all
>
> I'm currently benchmarking Lucene to get an understanding of what
> optimisations are available for long queries, and wanted to check what the
> recommended approach is.
>
> Unsurprisingly a naive approach to long queries (just keep adding SHOULD
> clauses to a big BooleanQuery) scales close to linearly in the number of
> terms, which beyond a certain point isn't good enough.
>
> The obvious solution is to prune the query in order to reduce the number
> of documents which need scoring, and this is easy to do, but has the
> downside that none of the pruned terms are used for scoring.
>
> In Xapian there's a handy query operator called OP_AND_MAYBE, where only
> terms on the left-hand-side are used to select documents, with terms on the
> right-hand-side used for scoring only. This performs much better for long
> queries if less discriminative terms are moved onto the right-hand-side.
>
> I tried to replicate this approach in Lucene using the following query (in
> QueryParser syntax):
>
> +(some mandatory terms) and some other terms for scoring only
>
> The presence of a MUST clause in the outer BooleanQuery forces the
> remaining SHOULD clauses to be purely optional and not expand the set of
> documents scored, so this has the right semantics. However the performance
> benefit isn't there -- in a test with 200 query terms in total, it quickly
> becomes slower than a plain flat BooleanQuery once the number of terms in
> the mandatory part of the query exceeds 5 or so.
>
> Interestingly it's much much faster (~40ms) when there's only one
> mandatory term, than when there are two terms in the mandatory clause
> (~2500ms), which leads me to suspect an obvious optimisation is being
> missed.
>
> Anyone have any ideas on this, pointers to other relevant query types or
> optimisations available in Lucene 4, or on which parts of the
> Query/Weight/Scorer code we'd need to change to speed up this kind of thing?
>
> Cheers
> -Matt
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message