Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 46871 invoked from network); 30 Oct 2003 02:44:34 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 30 Oct 2003 02:44:34 -0000 Received: (qmail 86239 invoked by uid 500); 30 Oct 2003 02:44:11 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 86209 invoked by uid 500); 30 Oct 2003 02:44:11 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 86190 invoked from network); 30 Oct 2003 02:44:11 -0000 Received: from unknown (HELO ghoul.scms.waikato.ac.nz) (130.217.241.35) by daedalus.apache.org with SMTP; 30 Oct 2003 02:44:11 -0000 Received: from theta.cs.waikato.ac.nz ([130.217.244.196] helo=cs.waikato.ac.nz) by ghoul.scms.waikato.ac.nz with esmtp (Exim 4.23) id 1AF2nK-0000FV-Cm for lucene-user@jakarta.apache.org; Thu, 30 Oct 2003 15:44:18 +1300 Message-ID: <3FA07B01.8020300@cs.waikato.ac.nz> Date: Thu, 30 Oct 2003 15:44:17 +1300 From: Gerret Apelt User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031023 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: term counts during indexing References: <20031030000918.4592.qmail@web12703.mail.yahoo.com> <004a01c39e7b$0bd8cef0$ce00a8c0@victor> <02ed01c39e7c$c53b5880$02a8a8c0@peter> In-Reply-To: <02ed01c39e7c$c53b5880$02a8a8c0@peter> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Peter Keegan wrote: >Is there a simple and efficient way of determining the number of tokens >added >to a document after adding each field ('Document.add), as a result of the >actions >of the Analyzer, without having to re-parse the field Peter -- you can ask the Document instance. Document doc = getDocumentInstanceFromSomewhere(); int termCount = 0; Enumertion fields = doc.fields(); while (fields.hasMoreElements()) { Field field = (Field)fields.nextElement(); String fieldName = field.name(); String[] fieldTerms = doc.getValues(fieldName); termCount += fieldTerms.length; } System.out.println("The fields of the document together contain "+termCount+" terms."); Note that 1) I haven't tried to compile this code, so I'm not sure if it works 2) this will only work for those fields where field.isStored() == true. If the field isnt stored in the index, then you don't have a choice but to go back to the document. [not sure on the following, so please correct me if in error:] Remember that unStored fields are indexed, so you can query on them, but the field terms themselves are not stored in the index. Therefore you cannot count them by asking Lucene. A Lucene field instance also has no way to reference the source of the terms that are added to it. The field doesn't care where its terms came from. So if field.isStored() == false, then for that particular field Lucene cannot tell you how many terms are in it. You'll have to write your own code that analyzes the original data source in this case. >Alternatively, is there a way to determine the number of tokens added after >adding the document to the index ('IndexWriter.addDocument')? > > Whether you want the termCount for a document before or after you add the document to the index doesn't matter, so the answer is "see above". cheers, Gerret --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org