Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 51641 invoked from network); 24 Dec 2003 02:00:53 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 24 Dec 2003 02:00:53 -0000 Received: (qmail 60075 invoked by uid 500); 24 Dec 2003 02:00:32 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 59961 invoked by uid 500); 24 Dec 2003 02:00:31 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 59948 invoked from network); 24 Dec 2003 02:00:31 -0000 Received: from unknown (HELO c000.snv.cp.net) (209.228.32.72) by daedalus.apache.org with SMTP; 24 Dec 2003 02:00:31 -0000 Received: (cpmta 26318 invoked from network); 23 Dec 2003 18:00:39 -0800 Received: from 24.51.109.181 (HELO ?192.168.1.101?) by smtp.hatcher.net (209.228.32.72) with SMTP; 23 Dec 2003 18:00:39 -0800 X-Sent: 24 Dec 2003 02:00:39 GMT Mime-Version: 1.0 (Apple Message framework v609) In-Reply-To: <20031222201505.59187.qmail@web20422.mail.yahoo.com> References: <20031222201505.59187.qmail@web20422.mail.yahoo.com> Content-Type: text/plain; charset=US-ASCII; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Erik Hatcher Subject: Re: efficient refinement, order by and range queries Date: Tue, 23 Dec 2003 21:00:38 -0500 To: "Lucene Users List" X-Mailer: Apple Mail (2.609) X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Geoffrey, You've done quite a thorough analysis of Lucene. I'll reply below with a few tidbits of Lucene trivia in hopes that will help.... On Dec 22, 2003, at 3:15 PM, Geoffrey Peddle wrote: > One of our applications is a catalog search > application. In this application our documents are > catalog items. Each item has a number of > fields/attributes associated with it. For example > Supplier, Part number, Price, Description. We use a > search metaphor where end-users iterate issuing > queries and getting feedback about what's available. > So initially we may tell them that 600,000 items are > available from 95 suppliers, and who those suppliers > are. They may choose to do a free text search for > the phrase "blue pen". The result of that query may > be to tell them that there's 240 items available from > 2 suppliers which match that phrase, and who those > suppliers are. They may pick one of the suppliers to > see the list of "blue pens" available from that > supplier. To accomplish "search within search", or "search refinement", using a QueryFilter will do very nicely. > In addition to wanting the set of attribute values > found in the result documents we would also want to > return counts of the number of documents each > attribute value occurs in in the result document set. Again, I think a QueryFilter can work well. There are surely several ways to go about getting the number of documents in each bucket - perhaps additional queries should be made to give you those numbers, or perhaps walking the returned documents to get the unique values. Walking the documents could be expensive performance-wise though. Doing some sub-queries would be quite fast though. > Efficient range queries. > > application) it's important to have some support for > this. The trick here is that the criteria may be > very open ended. For example all items with price > greater than $10 might involve tens of thousands of > prices. One suggestion I've seen posted is during indexing to use an additional field as a "group". In this case, it would be a price range group. Say "A" means $0 - $10, "B" for $10 - $100, "C" for $100+, for example. Then you would only have a few terms in that field and a query would be quite fast. The drawback is that you need to know at index-time what the groups are. A custom range Filter is another option - and could be created at runtime and kept around and only recreated when the index is modified. Look at the built-in DateFilter for an example to work with. This is a more pleasant option than doing a RangeQuery when the number of terms in the range is large. > Order by attributes. > > We need the ability to order the document results set > by a pre-defined set of numeric attributes and would > like the ability to order on alphabetic attributes as > well. This is an area where Lucene falls short. My best suggestion is to do the sorting yourself, which would require getting at all the documents in Hits, which for a large collection would be unreasonable. There are tricks that can be played with boosting during indexing where you can tier the boosts of a field in order - but this is really only a hint to the scorer to factor the order into the equation but there are many other factors. I'm afraid there is no easy solution here, that I'm aware of. > I have resources for code development and consider it > to be in Ariba's best interest to contribute any code > that we write in this area with the entire community. > Our time frame is to develop a proto-type in the next > couple of months for proof of concept and > benchmarking. Excellent! We hope that we can get Lucene under the covers of your products - please continue to post to us with more questions and hopefully eventually code improvements! Erik --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org