lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Sturge <>
Subject Re: Slow queries with lots of hits
Date Fri, 05 Dec 2008 20:00:33 GMT
I think we're going to cheat with this one. Two options:

1) Add a term to documents with a high enough setBoost(). Add this term to
queries that are insufficiently restrictive. This way only high scoring
documents will be considered.

2) sort the index data slightly before build time by boost (so highest boost
documents have lowest doc numbers.) Then stop the HitCollector once we have
10000 matching results.

2) is likely better but 1) is easier. I think "optimize_by_boost()" (which
automatically makes high scoring documents have low doc numbers) would be
interesting but it's more than I want to consider given how complex it is
relative to the simplicity of the above.


On 12/4/08 7:18 PM, "Otis Gospodnetic" <> wrote:

> Tim (and we should move this to java-dev if it gains traction),
> Perhaps you can come up with a mechanism to perform scoring in two passes
> instead of one:
> - first pass is cheap and fast
> - second pass is more expensive and slower
> Currently, there is no choice - Lucene does 2).  But perhaps you can come up
> with a generic way to do 1) ?
> Otis
> --
> Sematext -- -- Lucene - Solr - Nutch
> ----- Original Message ----
>> From: Tim Sturge <>
>> To: "" <>
>> Sent: Thursday, December 4, 2008 3:27:30 PM
>> Subject: Slow queries with lots of hits
>> Hi all,
>> I have an interesting problem with my query traffic. Most of the queries run
>> in a fairly short amount of time (< 100ms) but a few take over 1000ms. These
>> queries are predominantly those with a huge number of hits (>1 million hits
>> in a >100 million document index). The time taken (as far as I can tell) is
>> for lucene to sit there while it scores and sorts all these results.
>> However it turns out these queries really don¹t have top results. That is,
>> of the million documents, there are easily 10000 which are decent results
>> (basically those above some threshold score). Frankly, just returning some
>> consistent (so paging and reload work) but
>> otherwise arbitrary ranking of these 10000 results would be more than good
>> enough.
>> It seems to me that a solution would be to impose some sort of pseudo-random
>> filter (e.g. consider only every n-th document assuming they are uniformly
>> distributed). I¹m wondering if anyone else has experience with this sort of
>> issue and what solutions they have found to work well in practice.
>> Thanks,
>> Tim
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message