Peter Keegan wrote:
>Is there a simple and efficient way of determining the number of tokens
>added
>to a document after adding each field ('Document.add), as a result of the
>actions
>of the Analyzer, without having to re-parse the field
Peter --
you can ask the Document instance.
Document doc = getDocumentInstanceFromSomewhere();
int termCount = 0;
Enumertion fields = doc.fields();
while (fields.hasMoreElements()) {
Field field = (Field)fields.nextElement();
String fieldName = field.name();
String[] fieldTerms = doc.getValues(fieldName);
termCount += fieldTerms.length;
}
System.out.println("The fields of the document together contain
"+termCount+" terms.");
Note that
1) I haven't tried to compile this code, so I'm not sure if it works
2) this will only work for those fields where field.isStored() == true.
If the field isnt stored in the index, then you don't have a choice but
to go back to the document.
[not sure on the following, so please correct me if in error:] Remember
that unStored fields are indexed, so you can query on them, but the
field terms themselves are not stored in the index. Therefore you cannot
count them by asking Lucene. A Lucene field instance also has no way to
reference the source of the terms that are added to it. The field
doesn't care where its terms came from. So if field.isStored() == false,
then for that particular field Lucene cannot tell you how many terms are
in it. You'll have to write your own code that analyzes the original
data source in this case.
>Alternatively, is there a way to determine the number of tokens added after
>adding the document to the index ('IndexWriter.addDocument')?
>
>
Whether you want the termCount for a document before or after you add
the document to the index doesn't matter, so the answer is "see above".
cheers,
Gerret
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
|