Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Mime-Version: 1.0 (Apple Message framework v609)
In-Reply-To: <20031222201505.59187.qmail@web20422.mail.yahoo.com>
References: <20031222201505.59187.qmail@web20422.mail.yahoo.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed
Message-Id: <F8310DF1-35B4-11D8-8546-000393A564E6@ehatchersolutions.com>
Content-Transfer-Encoding: 7bit
From: Erik Hatcher <erik@ehatchersolutions.com>
Subject: Re: efficient refinement, order by and range queries
Date: Tue, 23 Dec 2003 21:00:38 -0500
To: "Lucene Users List" <lucene-user@jakarta.apache.org>

Geoffrey,

You've done quite a thorough analysis of Lucene.  I'll reply below with 
a few tidbits of Lucene trivia in hopes that will help....

On Dec 22, 2003, at 3:15 PM, Geoffrey Peddle wrote:
> One of our applications is a catalog search
> application.    In this application our documents are
> catalog items.   Each item has a number of
> fields/attributes associated with it.   For example
> Supplier, Part number, Price, Description.   We use a
> search metaphor where end-users iterate issuing
> queries and getting feedback about what's available.
> So initially we may tell them that 600,000 items are
> available from  95 suppliers, and who those suppliers
> are.   They may choose to do a free text search for
> the phrase "blue pen".  The result of that query may
> be to tell them that there's 240 items available from
> 2 suppliers which match that phrase, and who those
> suppliers are.  They may pick one of the suppliers to
> see the list of "blue pens" available from that
> supplier.

To accomplish "search within search", or "search refinement", using a 
QueryFilter will do very nicely.

> In addition to wanting the set of attribute values
> found in the result documents we would also want to
> return counts of the number of documents each
> attribute value occurs in in the result document set.

Again, I think a QueryFilter can work well.  There are surely several 
ways to go about getting the number of documents in each bucket - 
perhaps additional queries should be made to give you those numbers, or 
perhaps walking the returned documents to get the unique values.  
Walking the documents could be expensive performance-wise though.  
Doing some sub-queries would be quite fast though.

> Efficient range queries.
>
> application) it's important to have some support for
> this.    The trick here is that the criteria may be
> very open ended.   For example all items with price
> greater than $10 might involve tens of thousands of
> prices.

One suggestion I've seen posted is during indexing to use an additional 
field as a "group".  In this case, it would be a price range group.  
Say "A" means $0 - $10, "B" for $10 - $100, "C" for $100+, for example. 
  Then you would only have a few terms in that field and a query would 
be quite fast.  The drawback is that you need to know at index-time 
what the groups are.

A custom range Filter is another option - and could be created at 
runtime and kept around and only recreated when the index is modified.  
Look at the built-in DateFilter for an example to work with.  This is a 
more pleasant option than doing a RangeQuery when the number of terms 
in the range is large.


> Order by attributes.
>
> We need the ability to order the document results set
> by a pre-defined set of numeric attributes and would
> like the ability to order on alphabetic attributes as
> well.

This is an area where Lucene falls short.  My best suggestion is to do 
the sorting yourself, which would require getting at all the documents 
in Hits, which for a large collection would be unreasonable.  There are 
tricks that can be played with boosting during indexing where you can 
tier the boosts of a field in order - but this is really only a hint to 
the scorer to factor the order into the equation but there are many 
other factors.

I'm afraid there is no easy solution here, that I'm aware of.

> I have resources for code development and consider it
> to be in Ariba's best interest to contribute any code
> that we write in this area with the entire community.
> Our time frame is to develop a proto-type in the next
> couple of months for proof of concept and
> benchmarking.

Excellent!  We hope that we can get Lucene under the covers of your 
products - please continue to post to us with more questions and 
hopefully eventually code improvements!

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org