Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Message-ID: <3FA07B01.8020300@cs.waikato.ac.nz>
Date: Thu, 30 Oct 2003 15:44:17 +1300
From: Gerret Apelt <ga11@cs.waikato.ac.nz>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031023
MIME-Version: 1.0
To: Lucene Users List <lucene-user@jakarta.apache.org>
Subject: Re: term counts during indexing
References: <20031030000918.4592.qmail@web12703.mail.yahoo.com>
 <004a01c39e7b$0bd8cef0$ce00a8c0@victor>
 <02ed01c39e7c$c53b5880$02a8a8c0@peter>
In-Reply-To: <02ed01c39e7c$c53b5880$02a8a8c0@peter>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Peter Keegan wrote:

>Is there a simple and efficient way of determining the number of tokens
>added
>to a document after adding each field ('Document.add), as a result of the
>actions
>of the Analyzer, without having to re-parse the field

Peter --

you can ask the Document instance.

Document doc = getDocumentInstanceFromSomewhere();
int termCount = 0;
Enumertion fields = doc.fields();
while (fields.hasMoreElements()) {
    Field field = (Field)fields.nextElement();
    String fieldName = field.name();
    String[] fieldTerms = doc.getValues(fieldName);
    termCount += fieldTerms.length;
}
System.out.println("The fields of the document together contain 
"+termCount+" terms.");

Note that
1) I haven't tried to compile this code, so I'm not sure if it works
2) this will only work for those fields where field.isStored() == true. 
If the field isnt stored in the index, then you don't have a choice but 
to go back to the document.

[not sure on the following, so please correct me if in error:] Remember 
that unStored fields are indexed, so you can query on them, but the 
field terms themselves are not stored in the index. Therefore you cannot 
count them by asking Lucene. A Lucene field instance also has no way to 
reference the source of the terms that are added to it. The field 
doesn't care where its terms came from. So if field.isStored() == false, 
then for that particular field Lucene cannot tell you how many terms are 
in it. You'll have to write your own code that analyzes the original 
data source in this case.

>Alternatively, is there a way to determine the number of tokens added after
>adding the document to the index ('IndexWriter.addDocument')?
>  
>
Whether you want the termCount for a document before or after you add 
the document to the index doesn't matter, so the answer is "see above".

cheers,
Gerret


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org