lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Shaw <>
Subject Re: index architectures
Date Wed, 18 Oct 2006 15:29:32 GMT

On Wed, 2006-10-18 at 19:05 +1300, Paul Waite wrote:
> No they don't want that. They just want a small number. What happens is
> they enter some silly query, like searching for all stories with a single
> common non-stop-word in them, and with the usual sort criterion of by date
> (ie. a field) descending, and a limit of, say 25.
> So Lucene then presumably has to haul out a massive resultset, sort it, and
> return the top 25 (out of 500,000 or whatever).

I had a similar issue recently: users only want the 100 (or whatever)
most recently updated documents which match, and our documents aren't
stored in date-order.

Originally, we would walk the result set, instantiate a Document
instance, pull out the timestamp field, and keep around the top 100
documents.  Obviously this is extremely slow for large result sets.

What I initially did to address this was store a reverse timestamp and
walk the list of terms in the reverse timestamp field (they're sorted
lexigraphically), and return the 100 most recent matching documents.

In most cases this was a lot faster (for a search which returned 153,142
matches, I only had to walk 288 documents to find the 100 most recent),
but in some cases it was a lot slower (for another search which returned
339 matches, I had to walk 292,911 documents to find the 100 most

In the end I found that I could walk 5 terms for every 2 documents I
could instantiate and tuned a heuristic so that in the worst case (my
second example) searches are 50% slower, but in almost all other cases
they're quite a bit faster.

Hope this helps,

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message