lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Willson <matt...@swiftkey.net>
Subject Re: Long query optimisation: using some terms for scoring only
Date Tue, 11 Dec 2012 18:13:33 GMT
Hi lukai

That sounds like a nice optimisation, perhaps more sophisticated than 
the "AND_MAYBE" support I was looking for but a similar idea. Is the 
code available anywhere?

Cheers
-Matt


On 11/12/12 17:45, lukai wrote:
> I had implemented WAND in solr for our own project. It can improve the
> performance a lot. For your reference:
> http://dl.acm.org/citation.cfm?id=956944
>
> But it needs to change index a little bit.
>
> Thanks,
>
>
> On Tue, Dec 11, 2012 at 6:19 AM, Matthew Willson <matthew@swiftkey.net>wrote:
>
>> Hi all
>>
>> I'm currently benchmarking Lucene to get an understanding of what
>> optimisations are available for long queries, and wanted to check what the
>> recommended approach is.
>>
>> Unsurprisingly a naive approach to long queries (just keep adding SHOULD
>> clauses to a big BooleanQuery) scales close to linearly in the number of
>> terms, which beyond a certain point isn't good enough.
>>
>> The obvious solution is to prune the query in order to reduce the number
>> of documents which need scoring, and this is easy to do, but has the
>> downside that none of the pruned terms are used for scoring.
>>
>> In Xapian there's a handy query operator called OP_AND_MAYBE, where only
>> terms on the left-hand-side are used to select documents, with terms on the
>> right-hand-side used for scoring only. This performs much better for long
>> queries if less discriminative terms are moved onto the right-hand-side.
>>
>> I tried to replicate this approach in Lucene using the following query (in
>> QueryParser syntax):
>>
>> +(some mandatory terms) and some other terms for scoring only
>>
>> The presence of a MUST clause in the outer BooleanQuery forces the
>> remaining SHOULD clauses to be purely optional and not expand the set of
>> documents scored, so this has the right semantics. However the performance
>> benefit isn't there -- in a test with 200 query terms in total, it quickly
>> becomes slower than a plain flat BooleanQuery once the number of terms in
>> the mandatory part of the query exceeds 5 or so.
>>
>> Interestingly it's much much faster (~40ms) when there's only one
>> mandatory term, than when there are two terms in the mandatory clause
>> (~2500ms), which leads me to suspect an obvious optimisation is being
>> missed.
>>
>> Anyone have any ideas on this, pointers to other relevant query types or
>> optimisations available in Lucene 4, or on which parts of the
>> Query/Weight/Scorer code we'd need to change to speed up this kind of thing?
>>
>> Cheers
>> -Matt
>>
>> ------------------------------**------------------------------**---------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene.apache.org>
>> For additional commands, e-mail: java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message