Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 15834 invoked from network); 31 Dec 2007 17:55:10 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 31 Dec 2007 17:55:10 -0000 Received: (qmail 32860 invoked by uid 500); 31 Dec 2007 17:54:57 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 32801 invoked by uid 500); 31 Dec 2007 17:54:57 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 32790 invoked by uid 99); 31 Dec 2007 17:54:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Dec 2007 09:54:57 -0800 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.132.246] (HELO an-out-0708.google.com) (209.85.132.246) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Dec 2007 17:54:35 +0000 Received: by an-out-0708.google.com with SMTP id c5so905361anc.49 for ; Mon, 31 Dec 2007 09:54:38 -0800 (PST) Received: by 10.101.68.19 with SMTP id v19mr26251408ank.104.1199123678157; Mon, 31 Dec 2007 09:54:38 -0800 (PST) Received: by 10.100.135.6 with HTTP; Mon, 31 Dec 2007 09:54:38 -0800 (PST) Message-ID: <9ac0c6aa0712310954p7269d29rd3f3dfd4e5ea6bb0@mail.gmail.com> Date: Mon, 31 Dec 2007 12:54:38 -0500 From: "Michael McCandless" To: java-dev@lucene.apache.org Subject: Re: DocumentsWriter.checkMaxTermLength issues In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <2EE85DAC-AE41-490D-929A-7E929F737057@apache.org> <9ac0c6aa0712310253n4a649e8dse21b5ec94f86168@mail.gmail.com> <75455710-5470-472F-800D-201E4FED9F0F@apache.org> X-Virus-Checked: Checked by ClamAV on apache.org I actually think indexing should try to be as robust as possible. You could test like crazy and never hit a massive term, go into production (say, ship your app to lots of your customer's computers) only to suddenly see this exception. In general it could be a long time before you "accidentally" our users see this. So I'm thinking we should have the default behavior, in IndexWriter, be to skip immense terms? Then people can use TokenFilter to change this behavior if they want. Mike Yonik Seeley wrote: > On Dec 31, 2007 12:25 PM, Grant Ingersoll wrote: > > Sure, but I mean in the >16K (in other words, in the case where > > DocsWriter fails, which presumably only DocsWriter knows about) case. > > I want the option to ignore tokens larger than that instead of failing/ > > throwing an exception. > > I think the issue here is what the default behavior for IndexWriter should be. > > If configuration is required because something other than the default > is desired, then one could use a TokenFilter to change the behavior > rather than changing options on IndexWriter. Using a TokenFilter is > much more flexible. > > > Imagine I am charged w/ indexing some data > > that I don't know anything about (i.e. computer forensics), my goal > > would be to index as much as possible in my first raw pass, so that I > > can then begin to explore the dataset. Having it completely discard > > the document is not a good thing, but throwing away some large binary > > tokens would be acceptable (especially if I get warnings about said > > tokens) and robust. > > -Yonik > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org