lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: To Sort or not to Sort
Date Fri, 17 Dec 2004 01:02:22 GMT
To squeeze the best performance out of your first option, be sure to 
index your date field as a numeric (i.e. literally Long.toString, or 
Integer.toString value).  Then when you sort, be sure to specify the 
field type.  Sorting by String's is the most expensive, but numerics 
use far less resources.

I always recommend the simplest approach first - using the Sort 
facility in this case.  If that is good enough then move on to other 
things :)


On Dec 16, 2004, at 7:07 PM, Scott Smith wrote:

> I'm hoping someone has an opinion (based on some experience) as to how 
> I
> might approach a design I'm doing with Lucene.
> In my application, users search for messages with Lucene.  Typically,
> they are more interested in seeing their hits in date-order than in
> relevance-order.  In reading my ebook copy of "Lucene in action" (wish
> I'd had that a year ago), I find that one of the features added in 1.4
> was the ability to ask for hits in an order based on a field.  It also
> looks like adding the field necessary to get things by date order is
> straight forward.
> But, for my browser-based application I think there is another
> consideration.  Users will typically page through the messages 20-50 at
> a time and often they will only look through the first few "pages" of
> messages and then be done.  So, I think there are two possible designs.
> 1.	Simply use the built-in lucene sort functionality, cache the hit
> list and then page through the list.  Adv: looks pretty straight
> forward, I write less code.  Dis: for searches that return a large
> number of hits (having a search return several hundred to a few 
> thousand
> hits is not uncommon), Lucene is sorting a lot of entries that don't
> really need to be sorted (because the user will never look at them) and
> sorting tends to be expensive.
> 2.	The other solution uses a priority heap to collect the top N (or
> next N) entries.  I still have to walk the entire hit list, but keeping
> entries in a priority heap means I can determine the N entries I need
> with a few comparisons and minimal sorting.  I don't have to sort a
> bunch of entries whose order I don't care about.  Additionally, I don't
> have to have all of the entries in memory at one time.  The big
> disadvantage with this is that I have to write more code.  However, it
> may be worth it if the performance difference is large enough.
> This may be one of those questions where the only answer is "code it
> both ways and do speed trials".  I was just wondering if anyone had
> enough experience with either method to offer an opinion.  Are there
> things the Lucene sort is doing "under the covers" that will make it's
> ability to sort much faster than what I can do with the hit lists since
> I still have to force the IndexSearch object to retrieve all of the
> Documents in the hits list?
>  Opinions?
> Scott

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message