lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Willson <matt...@swiftkey.net>
Subject Long query optimisation: using some terms for scoring only
Date Tue, 11 Dec 2012 14:19:49 GMT
Hi all

I'm currently benchmarking Lucene to get an understanding of what 
optimisations are available for long queries, and wanted to check what 
the recommended approach is.

Unsurprisingly a naive approach to long queries (just keep adding SHOULD 
clauses to a big BooleanQuery) scales close to linearly in the number of 
terms, which beyond a certain point isn't good enough.

The obvious solution is to prune the query in order to reduce the number 
of documents which need scoring, and this is easy to do, but has the 
downside that none of the pruned terms are used for scoring.

In Xapian there's a handy query operator called OP_AND_MAYBE, where only 
terms on the left-hand-side are used to select documents, with terms on 
the right-hand-side used for scoring only. This performs much better for 
long queries if less discriminative terms are moved onto the 
right-hand-side.

I tried to replicate this approach in Lucene using the following query 
(in QueryParser syntax):

+(some mandatory terms) and some other terms for scoring only

The presence of a MUST clause in the outer BooleanQuery forces the 
remaining SHOULD clauses to be purely optional and not expand the set of 
documents scored, so this has the right semantics. However the 
performance benefit isn't there -- in a test with 200 query terms in 
total, it quickly becomes slower than a plain flat BooleanQuery once the 
number of terms in the mandatory part of the query exceeds 5 or so.

Interestingly it's much much faster (~40ms) when there's only one 
mandatory term, than when there are two terms in the mandatory clause 
(~2500ms), which leads me to suspect an obvious optimisation is being 
missed.

Anyone have any ideas on this, pointers to other relevant query types or 
optimisations available in Lucene 4, or on which parts of the 
Query/Weight/Scorer code we'd need to change to speed up this kind of thing?

Cheers
-Matt

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message