Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 25031 invoked from network); 10 Oct 2005 00:43:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 10 Oct 2005 00:43:08 -0000 Received: (qmail 63575 invoked by uid 500); 10 Oct 2005 00:43:06 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 63540 invoked by uid 500); 10 Oct 2005 00:43:06 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 63529 invoked by uid 99); 10 Oct 2005 00:43:06 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 09 Oct 2005 17:43:06 -0700 X-ASF-Spam-Status: No, hits=1.0 required=10.0 tests=SPF_HELO_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [169.229.70.167] (HELO rescomp.berkeley.edu) (169.229.70.167) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 09 Oct 2005 17:43:07 -0700 Received: by rescomp.berkeley.edu (Postfix, from userid 1007) id CF2595B7BE; Sun, 9 Oct 2005 17:42:41 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by rescomp.berkeley.edu (Postfix) with ESMTP id C5DDB7F45E for ; Sun, 9 Oct 2005 17:42:41 -0700 (PDT) Date: Sun, 9 Oct 2005 17:42:41 -0700 (PDT) From: Chris Hostetter Sender: hossman@hal.rescomp.berkeley.edu To: java-dev@lucene.apache.org Subject: Re: Adding information to an index In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N : I'm looking to store some additional information in a Lucene index : and I'm looking for an advise on how to implement the functionality. : Specifically, I'm planning to store 1) collection frequency count for : each term, 2) actual document length for each document (yes, I looked : at the norm factor, I'm still considering how to adapt it...) 3) : collection size (total number of terms) for each field 4) vocabulary : size (number of unique terms) for each field. All this info can be : computed on the fly, but I would prefer to generate it at the : indexing time and store somewhere. Unless I'm missunderstanding your terminology, It seems like all of this information is either already stored in the index, or easy to add using the existing API #1 - Searchable.docFreq(Term):int #2 - add as a new field per document. #3 & #4 ... ...these are a little trickier. You can easily get both by iterating over IndexReader.terms(), but if you specifically want to store the data in the index, I would first add all of your documents, then use the TermEnum to compute the information and put it all as stored fields in a single "metadata" document with no indexed fields (or at least: none in common with your regular data). now you've precomputed everything you want to know, and it's easily available at query time. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org