lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jochen Frey" <jochen_f...@yahoo.com>
Subject Sentence Endings: IndexWriter.maxFieldLength and Token.setPositionIncrement()
Date Fri, 19 Dec 2003 23:08:36 GMT
Hi!

I hope this is the right forum for this post.

I was wondering if other people would consider this a bug (it might be a
feature and I am missing the point of it):
 
.The default IndexWriter.maxFieldLength is 10,000.
.The point of maxFieldLength is to limit memory usage.
.The current position (which is compared against maxFieldLength) is
essentially determined by the sum of the PositionIncrements of all Tokens
added to the index.

Why does this matter? If you have setPositionIncrement(1000) for sentence
ending tokens, only the first 10 sentences of your document will be indexed,
the rest will not be searchable (since position will be greater than
10,000).

Why I think this is a bug: If you skip 1000 positions, no memory is required
by the DocumentWriter for the empty 999 positions, thus not using
maxFieldLength to limit memory but simply available positions.

I suggest that there be a counter in DocumentWriter, that counts the actual
number of tokens in the postingTable (probably in
DocumentWriter.addPosition), so that maxFieldLength is compared against the
number actual entries, not the number of actual entries and the number
skipped entries.

Best,
	Jochen

PS: Please let me know if this is the wrong forum for this so I'll post to
the right one next time.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message