lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: TermFrequencies vector limits?
Date Mon, 21 Nov 2005 20:44:14 GMT

: " By default, no more than 10,000 terms will be
: indexed for a field."
:
: Given your note, then the docs do not mean that no
: more than 10,000 terms will be indexed, but that some
: smaller number of terms will be indexed and only the
: first 10,000 occurrances will be tallied.

It means that by default only the first 10,000 terms of a document are
indexed.  If only 10,000 terms are indexed for any doc, then the sum of
the term frequencies for any single doc should never be more then 10,000.

Your confusion sems to be that you are thinking about indexing "terms"
along with their TermFrequencies - so for you the sentence "Why, Why, Why
oh Why did you go?" has 5 terms -- but in the context of analysis and
maxFieldLength it has a sequence of 8 terms.

If you setMaxFieldLength to 4, then you would wind up seeing only 4 terms
in your index: "Why" with a termFreq of 3 and "oh" with a termFreq of 1.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message