Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Date: 8 Apr 2003 20:00:12 -0000
Message-ID: <20030408200012.3468.qmail@nagoya.betaversion.org>
From: bugzilla@apache.org
To: lucene-dev@jakarta.apache.org
Cc: 
Subject: DO NOT REPLY [Bug 18833]  -
    maxFieldLength design flaw: large documents silently truncated

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18833>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18833

maxFieldLength design flaw: large documents silently truncated


------- Additional Comments From cutting@apache.org  2003-04-08 20:00 -------
This is fairly common in search engines.  For example, Google silently truncates
pages whose HTML is longer than 100kB, around the same point where Lucene
truncates.  The problem is that crawlers and file system walkers would otherwise
attempt to index things like gigantic log files, binaries, etc.

I see your point though that for some classes of use, when the set of documents
is tightly controlled and it is a requirement that every single word is indexed,
this is a problem.  The workaround is simple, although perhaps not obvious.

My concern with changing the default is that it would break all those folks who
depend on the current setting to keep their indexing from blowing up.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org