Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 32125 invoked from network); 16 Nov 2004 19:29:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 16 Nov 2004 19:29:57 -0000 Received: (qmail 51600 invoked by uid 500); 16 Nov 2004 19:29:54 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 51571 invoked by uid 500); 16 Nov 2004 19:29:54 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 51553 invoked by uid 99); 16 Nov 2004 19:29:54 -0000 X-ASF-Spam-Status: No, hits=1.0 required=10.0 tests=SPF_HELO_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from [64.78.19.14] (HELO reh001-1.REX001.ExchangeByRegister.com) (64.78.19.14) by apache.org (qpsmtpd/0.28) with ESMTP; Tue, 16 Nov 2004 11:29:51 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: quoted-printable Subject: RE: Lucene - index fields design question Date: Tue, 16 Nov 2004 11:29:45 -0800 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Lucene - index fields design question Thread-Index: AcTMBN70Z5t9R/miRMmYxcghv5Mn8wACjN1A From: "Chuck Williams" To: "Lucene Users List" , "Venkatraju" X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N I do most of these same things and made these relevant design decisions: 1. Use a combination of query expansion to search across multiple fields and field concatenation to create document fields that combine separate object fields. I use multiple fields only when it is important to weight them differently. E.g., in my case the separate fields are combined into just title and body document fields for general term searching. I expand queries (with my own expander after parsing) by rewriting queries against the "default" field into an OR across title and body with title boosted higher than body. 2. One problem with the above concerns scoring (and this is also one of the reasons to use concatenation rather than query expansion as much as possible). Lucene's BooleanQuery use sum-based scoring for OR's that is further factored with the coord() adjustment (settable in the Similarity). This causes OR's to behave very poorly for the field-expansion case. E.g., if the query is foo bar, and you expand each term into title and body in the simplest way to produce title:foo^4 body:foo title:bar^4 body:bar, then a document with foo in title and bar in body will get the same score as one with foo in title and foo in body, clearly not desired. There are at least 3 different solutions to this problem discussed on this list. I wrote my own MaxDisjunctionQuery just to handle this case: it uses max instead of sum for this kind of OR query, and it does not use coord() (so use MaxDisjunctionQuery for the OR's of the same term or other query across multiple fields, and regular BooleanQuery to OR together the different terms or other queries). Paul Elschot wrote a more general DisjunctionQuery that can be configured to do the same thing. Doug Cutting came up with a solution that does not require a new Query class; his solution expands the query in a certain way and specializes certain existing methods. You should be able to find these solutions by searching the archive (e.g., search for MaxDisjunctionQuery and DisjunctionQuery and read the threads). Code is posted in one way or other. 3. RangeQuery's are the way to do your date ranges, or any other ranges. The encodings need to be lexicographic, not integer. E.g., 10 precedes 2, so pad with leading 0's (02 precedes 10). If you need negatives or floats, you need additional considerations to ensure consistency with lexicographic order (invert the order of negatives and use a sign representation such that the positive sign indicator follows the negative sign indicator; floats require nothing special so long as the integer portion is fixed length). Dates encode naturally. I add additional fields like those used to search Ranges onto the Lucene documents in addition to title and body. There are numerous messages on the list that discuss details of this, and there is a link to the web site that goes through a complete example, including showing how to specialize the query parser if you want users entering RangeQuery's in Lucene syntax (either way you have to lexicographically encode both queries and the document fields you index). If you have more specific questions or cannot find the references, please just ask. Good luck, Chuck > -----Original Message----- > From: Venkatraju [mailto:venkatraju@gmail.com] > Sent: Tuesday, November 16, 2004 9:51 AM > To: lucene-user > Subject: Lucene - index fields design question >=20 > Hi, >=20 > I am a new user of Lucene. so please point me to > documentation/archives if these issues have been covered before. >=20 > I plan to use Lucene in a application with the following (fairly > standard) requirements: > - Index documents that contain a title, author, date and content > - It is fairly common to search for some text across all the fields > - Matches in the title field should be given more weightage over > matches in the content field > - Provide an option to restrict search to documents within a date range >=20 > Give these requirements, what is a good index design with search speed > in mind? > Documents will have fields "title", "author", "date" and "content". > Should I make title and author part of the "content" as well so that > "search across all fields" will just become a search in "content" > field? If so, how do I give more weightage to matches in "title" > field? >=20 > The other option would be to expand a simple query to include searches > across all fields. > Ex.: Expand "abcd" to "title:abcd^4 OR content:abcd". Also, should the > boost for title field be applied in the query or is it better to > provide a boost to the title field during indexing (is that possible)? > Which of these options will work and be more effecient? >=20 > For date range limited search, can field values be integers? If not, > encoding the date as "YYYYMMDDHHMM" and then use a filter or a > RangeQuery - is that the way to do this? >=20 > Thanks, > Venkat >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org