lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Keegan" <peter.kee...@charter.net>
Subject Re: term counts during indexing
Date Thu, 30 Oct 2003 16:52:36 GMT
As I understand it, the field text is being tokenized by the analyzer when
IndexWriter.addDocument is called. At this point, the tokens are indexed
and/or stored. Would it be possible for 'addDocument' to save and make the
_actual_ counts of 'tokens stored' and 'tokens indexed' available in either
the Document or IndexWriter object? I guess I may be turning this into a
feature request :)

Also, I can't find this method from the code snippit provided by Gerret (I'm
using v1.2):
> String[] fieldTerms = doc.getValues(fieldName);


Thanks,
Peter

----- Original Message ----- 
From: "Gerret Apelt" <ga11@cs.waikato.ac.nz>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Wednesday, October 29, 2003 9:44 PM
Subject: Re: term counts during indexing


> Peter Keegan wrote:
>
> >Is there a simple and efficient way of determining the number of tokens
> >added
> >to a document after adding each field ('Document.add), as a result of the
> >actions
> >of the Analyzer, without having to re-parse the field
>
> Peter --
>
> you can ask the Document instance.
>
> Document doc = getDocumentInstanceFromSomewhere();
> int termCount = 0;
> Enumertion fields = doc.fields();
> while (fields.hasMoreElements()) {
>     Field field = (Field)fields.nextElement();
>     String fieldName = field.name();
>     String[] fieldTerms = doc.getValues(fieldName);
>     termCount += fieldTerms.length;
> }
> System.out.println("The fields of the document together contain
> "+termCount+" terms.");
>
> Note that
> 1) I haven't tried to compile this code, so I'm not sure if it works
> 2) this will only work for those fields where field.isStored() == true.
> If the field isnt stored in the index, then you don't have a choice but
> to go back to the document.
>
> [not sure on the following, so please correct me if in error:] Remember
> that unStored fields are indexed, so you can query on them, but the
> field terms themselves are not stored in the index. Therefore you cannot
> count them by asking Lucene. A Lucene field instance also has no way to
> reference the source of the terms that are added to it. The field
> doesn't care where its terms came from. So if field.isStored() == false,
> then for that particular field Lucene cannot tell you how many terms are
> in it. You'll have to write your own code that analyzes the original
> data source in this case.
>
> >Alternatively, is there a way to determine the number of tokens added
after
> >adding the document to the index ('IndexWriter.addDocument')?
> >
> >
> Whether you want the termCount for a document before or after you add
> the document to the index doesn't matter, so the answer is "see above".
>
> cheers,
> Gerret
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message