lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 18833] - maxFieldLength design flaw: large documents silently truncated
Date Tue, 08 Apr 2003 20:00:12 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18833>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18833

maxFieldLength design flaw: large documents silently truncated





------- Additional Comments From cutting@apache.org  2003-04-08 20:00 -------
This is fairly common in search engines.  For example, Google silently truncates
pages whose HTML is longer than 100kB, around the same point where Lucene
truncates.  The problem is that crawlers and file system walkers would otherwise
attempt to index things like gigantic log files, binaries, etc.

I see your point though that for some classes of use, when the set of documents
is tightly controlled and it is a requirement that every single word is indexed,
this is a problem.  The workaround is simple, although perhaps not obvious.

My concern with changing the default is that it would break all those folks who
depend on the current setting to keep their indexing from blowing up.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message