lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 18833] - maxFieldLength design flaw: large documents silently truncated
Date Tue, 08 Apr 2003 20:34:13 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18833>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18833

maxFieldLength design flaw: large documents silently truncated





------- Additional Comments From alex@apache.org  2003-04-08 20:34 -------
Yeah, that "for some classes of use" is a killer, especially for a general-
purpose library like a search engine.  I totally buy your reasoning that you 
shouldn't break existing crawlers.  

At the same time it's disturbing that it's silent.  It caused the team I was 
working with to spend many hours isolating and tracking down the bug until 
someone carefully re-read the documentation for all the Lucene classes we were 
using...

Even if you don't change the implementation for 1.3, it would be excellent to 
document it more clearly in both the field and the addDocument method.

To the Javadoc for IndexWriter.maxFieldLength, I would add "Note that this 
effectively truncates large documents, excluding from the index terms that 
occur late in the document.  If you know your source documents are large, be 
sure to set this value high enough to accomodate the expected size.  If you set 
it to Integer.MAX_VALUE, then the only limit is your memory, but you should 
anticipate an OutOfMemoryError."

To the JavaDoc for IndexWriter.addDocument, I would add "If the document 
contains more than {@link #maxFieldLength} terms for a given field, the 
remainder are discarded."

Also, the truncating when it happens could be logged somehow. Would it be 
appropriate for Lucene to support (or include) Commons Logging?

Thanks for your attention!

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message