Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 47357 invoked from network); 7 Mar 2010 09:51:11 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 Mar 2010 09:51:11 -0000 Received: (qmail 49870 invoked by uid 500); 7 Mar 2010 09:50:51 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 49433 invoked by uid 500); 7 Mar 2010 09:50:51 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 49426 invoked by uid 99); 7 Mar 2010 09:50:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Mar 2010 09:50:50 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Mar 2010 09:50:48 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 4380F234C4A9 for ; Sun, 7 Mar 2010 09:50:27 +0000 (UTC) Message-ID: <1903503449.123421267955427275.JavaMail.jira@brutus.apache.org> Date: Sun, 7 Mar 2010 09:50:27 +0000 (UTC) From: "Uwe Schindler (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2295) Create a MaxFieldLengthAnalyzer to wrap any other Analyzer and provide the same functionality as MaxFieldLength provided on IndexWriter In-Reply-To: <1214009040.96971267788447119.JavaMail.jira@brutus.apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12842384#action_12842384 ] Uwe Schindler commented on LUCENE-2295: --------------------------------------- Further investigantions showed, that there is some difference between using this filter/analyzer and the current setting in IndexWriter. IndexWriter uses the given MaxFieldLength as maximum value for all instances of the same field name. So if you add 100 fields "foo" (with each 1,000 terms) and have the default of 10,000 tokens, DocInverter will index 10 of these field instances (10,000 terms in total) and the rest will be supressed. If you use the Filter, the limit is per TokenStream, so the above example will index all field instances and produce 100,000 terms. But the current IndexWriter code has a bug, too: The check for too many terms is done after the first token of each input stream is indexed, so in the abovce example, IW will index 10,089 terms, because once the limit is reached, each stream left will index one term. This could be fixed (if really needed, as the MaxFieldLength in IW should be deprecated) by moving the check up and dont even try to index the field and create the TokenStream. I just wanted to add this difference here for further discussing. > Create a MaxFieldLengthAnalyzer to wrap any other Analyzer and provide the same functionality as MaxFieldLength provided on IndexWriter > --------------------------------------------------------------------------------------------------------------------------------------- > > Key: LUCENE-2295 > URL: https://issues.apache.org/jira/browse/LUCENE-2295 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Shai Erera > Assignee: Uwe Schindler > Fix For: 3.1 > > Attachments: LUCENE-2295.patch > > > A spinoff from LUCENE-2294. Instead of asking the user to specify on IndexWriter his requested MFL limit, we can get rid of this setting entirely by providing an Analyzer which will wrap any other Analyzer and its TokenStream with a TokenFilter that keeps track of the number of tokens produced and stop when the limit has reached. > This will remove any count tracking in IW's indexing, which is done even if I specified UNLIMITED for MFL. > Let's try to do it for 3.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org