Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 49204 invoked from network); 8 Apr 2003 19:58:19 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 8 Apr 2003 19:58:19 -0000 Received: (qmail 3482 invoked by uid 97); 8 Apr 2003 20:00:14 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@nagoya.betaversion.org Received: (qmail 3475 invoked from network); 8 Apr 2003 20:00:14 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 8 Apr 2003 20:00:14 -0000 Received: (qmail 48917 invoked by uid 500); 8 Apr 2003 19:58:16 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 48900 invoked from network); 8 Apr 2003 19:58:16 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 8 Apr 2003 19:58:16 -0000 Received: (qmail 3469 invoked by uid 50); 8 Apr 2003 20:00:12 -0000 Date: 8 Apr 2003 20:00:12 -0000 Message-ID: <20030408200012.3468.qmail@nagoya.betaversion.org> From: bugzilla@apache.org To: lucene-dev@jakarta.apache.org Cc: Subject: DO NOT REPLY [Bug 18833] - maxFieldLength design flaw: large documents silently truncated X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT . ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE. http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18833 maxFieldLength design flaw: large documents silently truncated ------- Additional Comments From cutting@apache.org 2003-04-08 20:00 ------- This is fairly common in search engines. For example, Google silently truncates pages whose HTML is longer than 100kB, around the same point where Lucene truncates. The problem is that crawlers and file system walkers would otherwise attempt to index things like gigantic log files, binaries, etc. I see your point though that for some classes of use, when the set of documents is tightly controlled and it is a requirement that every single word is indexed, this is a problem. The workaround is simple, although perhaps not obvious. My concern with changing the default is that it would break all those folks who depend on the current setting to keep their indexing from blowing up. --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org