lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set<String> instead of CharArraySet
Date Tue, 23 Feb 2010 19:42:28 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837410#action_12837410
] 

Robert Muir commented on LUCENE-2279:
-------------------------------------

reusableTokenStream() is called again for each document. if you don't implement it, the default
is to defer to tokenStream(), which must create new instances of StopFilter, LowerCaseFilter,
whatever else you have going on in your analyzer.

instead, if you implement reusableTokenStream(), you can keep a reference to these things,
and just reset() your tokenfilters, and pass the reader to your tokenizer's reset(Reader)
method.

of course, for this to work, you must implement reset() correctly in any custom filters you
have: if they keep some state such as accumulated offsets or something, then these should
be reset back to what they are just as if you created a new one.

For an example, see StandardAnalyzer

> eliminate pathological performance on StopFilter when using a Set<String> instead
of CharArraySet
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2279
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: thushara wijeratna
>
> passing a Set<Srtring> to a StopFilter instead of a CharArraySet results in a very
slow filter.
> this is because for each document, Analyzer.tokenStream() is called, which ends up calling
the StopFilter (if used). And if a regular Set<String> is used in the StopFilter all
the elements of the set are copied to a CharArraySet, as we can see in it's ctor:
> public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords,
boolean ignoreCase)
>   {
>     super(input);
>     if (stopWords instanceof CharArraySet) {
>       this.stopWords = (CharArraySet)stopWords;
>     } else {
>       this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
>       this.stopWords.addAll(stopWords);
>     }
>     this.enablePositionIncrements = enablePositionIncrements;
>     init();
>   }
> i feel we should make the StopFilter signature specific, as in specifying CharArraySet
vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter
as they all result in a copy for each invocation of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message