lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Sturge <>
Subject Re: Slow queries with lots of hits
Date Thu, 04 Dec 2008 20:52:54 GMT
That makes sense. I should be more precise in that all I need is 100 of the
10000 "reasonable" results.

The concern I would have with a TopDocCollector is that this is biased
towards the top of the index which translates for me into a bias for older
documents. I'd prefer no age bias or a newer document bias. So I'll see what
I can do with a "BottomDocCollector" :-)


On 12/4/08 12:39 PM, "Erick Erickson" <> wrote:

> The problem here is how *could* a system return even the top
> 10,000 results without scoring them all? What if the millionth
> hit resulted in the very best match in the entire corpus?
> That said, sorting may well be the issue here rather than scoring.
> You can use a TopDocCollector to get the top N matches (unsorted)
> and then do something like use the FieldSortedHitQueue to sort
> those N matches, leaving out all the rest of the matches. Note
> this assumes that when you say "sorting" you mean sorting
> by something other than relevance.....
> Hope this helps
> Erick
> On Thu, Dec 4, 2008 at 3:27 PM, Tim Sturge <> wrote:
>> Hi all,
>> I have an interesting problem with my query traffic. Most of the queries
>> run
>> in a fairly short amount of time (< 100ms) but a few take over 1000ms.
>> These
>> queries are predominantly those with a huge number of hits (>1 million hits
>> in a >100 million document index). The time taken (as far as I can tell) is
>> for lucene to sit there while it scores and sorts all these results.
>> However it turns out these queries really don¹t have top results. That is,
>> of the million documents, there are easily 10000 which are decent results
>> (basically those above some threshold score). Frankly, just returning some
>> consistent (so paging and reload work) but
>> otherwise arbitrary ranking of these 10000 results would be more than good
>> enough.
>> It seems to me that a solution would be to impose some sort of
>> pseudo-random
>> filter (e.g. consider only every n-th document assuming they are uniformly
>> distributed). I¹m wondering if anyone else has experience with this sort of
>> issue and what solutions they have found to work well in practice.
>> Thanks,
>> Tim

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message