lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: efficient refinement, order by and range queries
Date Wed, 24 Dec 2003 02:00:38 GMT

You've done quite a thorough analysis of Lucene.  I'll reply below with 
a few tidbits of Lucene trivia in hopes that will help....

On Dec 22, 2003, at 3:15 PM, Geoffrey Peddle wrote:
> One of our applications is a catalog search
> application.    In this application our documents are
> catalog items.   Each item has a number of
> fields/attributes associated with it.   For example
> Supplier, Part number, Price, Description.   We use a
> search metaphor where end-users iterate issuing
> queries and getting feedback about what's available.
> So initially we may tell them that 600,000 items are
> available from  95 suppliers, and who those suppliers
> are.   They may choose to do a free text search for
> the phrase "blue pen".  The result of that query may
> be to tell them that there's 240 items available from
> 2 suppliers which match that phrase, and who those
> suppliers are.  They may pick one of the suppliers to
> see the list of "blue pens" available from that
> supplier.

To accomplish "search within search", or "search refinement", using a 
QueryFilter will do very nicely.

> In addition to wanting the set of attribute values
> found in the result documents we would also want to
> return counts of the number of documents each
> attribute value occurs in in the result document set.

Again, I think a QueryFilter can work well.  There are surely several 
ways to go about getting the number of documents in each bucket - 
perhaps additional queries should be made to give you those numbers, or 
perhaps walking the returned documents to get the unique values.  
Walking the documents could be expensive performance-wise though.  
Doing some sub-queries would be quite fast though.

> Efficient range queries.
> application) it's important to have some support for
> this.    The trick here is that the criteria may be
> very open ended.   For example all items with price
> greater than $10 might involve tens of thousands of
> prices.

One suggestion I've seen posted is during indexing to use an additional 
field as a "group".  In this case, it would be a price range group.  
Say "A" means $0 - $10, "B" for $10 - $100, "C" for $100+, for example. 
  Then you would only have a few terms in that field and a query would 
be quite fast.  The drawback is that you need to know at index-time 
what the groups are.

A custom range Filter is another option - and could be created at 
runtime and kept around and only recreated when the index is modified.  
Look at the built-in DateFilter for an example to work with.  This is a 
more pleasant option than doing a RangeQuery when the number of terms 
in the range is large.

> Order by attributes.
> We need the ability to order the document results set
> by a pre-defined set of numeric attributes and would
> like the ability to order on alphabetic attributes as
> well.

This is an area where Lucene falls short.  My best suggestion is to do 
the sorting yourself, which would require getting at all the documents 
in Hits, which for a large collection would be unreasonable.  There are 
tricks that can be played with boosting during indexing where you can 
tier the boosts of a field in order - but this is really only a hint to 
the scorer to factor the order into the equation but there are many 
other factors.

I'm afraid there is no easy solution here, that I'm aware of.

> I have resources for code development and consider it
> to be in Ariba's best interest to contribute any code
> that we write in this area with the entire community.
> Our time frame is to develop a proto-type in the next
> couple of months for proof of concept and
> benchmarking.

Excellent!  We hope that we can get Lucene under the covers of your 
products - please continue to post to us with more questions and 
hopefully eventually code improvements!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message