Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 33482 invoked from network); 9 Mar 2005 19:44:22 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 9 Mar 2005 19:44:22 -0000 Received: (qmail 31353 invoked by uid 500); 9 Mar 2005 19:44:18 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 31327 invoked by uid 500); 9 Mar 2005 19:44:18 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 31314 invoked by uid 99); 9 Mar 2005 19:44:17 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from Unknown (HELO ehatchersolutions.com) (69.55.225.129) by apache.org (qpsmtpd/0.28) with ESMTP; Wed, 09 Mar 2005 11:44:16 -0800 Received: by ehatchersolutions.com (Postfix, from userid 504) id 8215613E2049; Wed, 9 Mar 2005 14:44:14 -0500 (EST) Received: from [128.143.167.124] (d-128-167-124.bootp.Virginia.EDU [128.143.167.124]) by ehatchersolutions.com (Postfix) with ESMTP id 0FCC513E200A for ; Wed, 9 Mar 2005 14:44:01 -0500 (EST) Mime-Version: 1.0 (Apple Message framework v619.2) In-Reply-To: References: Content-Type: text/plain; charset=US-ASCII; format=flowed Message-Id: <8560cb322eee463457f307300710e4f5@ehatchersolutions.com> Content-Transfer-Encoding: 7bit From: Erik Hatcher Subject: Re: identifier field as keyword or unindexed Date: Wed, 9 Mar 2005 14:43:58 -0500 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.619.2) X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on javelina X-Spam-Status: No, score=-3.2 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.0.1 X-Spam-Level: X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N On Mar 9, 2005, at 10:09 AM, javier muguruza wrote: > (I sent this to the old list, I dont know wether it reached the > list...just in case I repost it) > > Hi all, > > We index our documents in the following way: > > doc = new Document(); > // mailid > doc.add(Field.UnIndexed("mid",mid)); > //body > doc.add(Field.UnStored("body", textb)); > > mid is a unique identifier, and body contains long pieces of text to > be indexed. > > And later make searches on the body field, the mid allows us to find a > file on the filesystem with a compressed (and digitally signed) > version of the original body indexed. > Our way to work in a query in our app is this: > 1. first we make a search in a db (for many different reasons) that > returns a number (from 0 to thousands) of mid > 2. we use lucene to search for some text in many indexes, this returns > a second list of mid > 3. we return the result as the intersection of both lists. > > This is working fine right now, but wonder wether we are not using > lucene to the fullest, cause we could also store mid as a keyword > (instead of unindexed), and add the condition (AND mid==[any mid from > our step 1]) to the lucene query we run. My questions are: > > 1. Is there a limit in the number of conditions I can add to a query?? > Sometimes we have 10 mids, other times we have thousands of them so we > would have to add: AND (mid:mid1 OR mid:mid2 ... OR mid:mid10000). > Probably there is a limit, and we could only apply the mid conditions > when the number or mids returned by step 1 is smaller than that limit? BooleanQuery has a built-in limit of 1,024 clauses so it would only be useful when there is a small number of mids. Consider using a Filter though. There are some built-in ones, but maybe a custom one is best. > 2. As the mid is a unique identifier (I guest lucene does not care > about that right?) Right, Lucene doesn't care about field/term uniqueness. > , and the condition on the mid woudl be ANDed to the > text query conditions, will it be faster for lucene to look first in > the mid field and dont do the text lookup if the mid condition is not > fullfilled? I dont know wether I am clear enough...Will I get some > benefit on the queries by adding some additional conditions or the > cost of adding another field to index will not pay off? Maybe it > depends on the number of documents? Maybe it would be best to set mid > as a keyword just in case, and add it as conditions later if the > searches take too long? I doubt you'd even notice the difference. There is little cost to adding the additional field, and looks like you'd benefit from having mid as a Keyword. Also, with a Filter, you could use it to bounce to your relational database to constrain results based on a set of mids. Filters are designed to be used for multiple queries and cached - keep that in mind and maybe it'll work out well in your scenario. Erik > > thanks for any though on that > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org