Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 42403 invoked from network); 23 Feb 2010 19:42:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Feb 2010 19:42:50 -0000 Received: (qmail 87420 invoked by uid 500); 23 Feb 2010 19:42:49 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 87346 invoked by uid 500); 23 Feb 2010 19:42:49 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 87337 invoked by uid 99); 23 Feb 2010 19:42:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Feb 2010 19:42:49 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Feb 2010 19:42:49 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id D4764234C052 for ; Tue, 23 Feb 2010 11:42:28 -0800 (PST) Message-ID: <1361544806.469601266954148868.JavaMail.jira@brutus.apache.org> Date: Tue, 23 Feb 2010 19:42:28 +0000 (UTC) From: "Robert Muir (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set instead of CharArraySet In-Reply-To: <1238542819.449261266885330874.JavaMail.jira@brutus.apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837410#action_12837410 ] Robert Muir commented on LUCENE-2279: ------------------------------------- reusableTokenStream() is called again for each document. if you don't implement it, the default is to defer to tokenStream(), which must create new instances of StopFilter, LowerCaseFilter, whatever else you have going on in your analyzer. instead, if you implement reusableTokenStream(), you can keep a reference to these things, and just reset() your tokenfilters, and pass the reader to your tokenizer's reset(Reader) method. of course, for this to work, you must implement reset() correctly in any custom filters you have: if they keep some state such as accumulated offsets or something, then these should be reset back to what they are just as if you created a new one. For an example, see StandardAnalyzer > eliminate pathological performance on StopFilter when using a Set instead of CharArraySet > ------------------------------------------------------------------------------------------------- > > Key: LUCENE-2279 > URL: https://issues.apache.org/jira/browse/LUCENE-2279 > Project: Lucene - Java > Issue Type: Improvement > Reporter: thushara wijeratna > > passing a Set to a StopFilter instead of a CharArraySet results in a very slow filter. > this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular Set is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: > public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) > { > super(input); > if (stopWords instanceof CharArraySet) { > this.stopWords = (CharArraySet)stopWords; > } else { > this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); > this.stopWords.addAll(stopWords); > } > this.enablePositionIncrements = enablePositionIncrements; > init(); > } > i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org