lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set<String> instead of CharArraySet
Date Wed, 24 Feb 2010 11:54:28 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837752#action_12837752
] 

Simon Willnauer commented on LUCENE-2279:
-----------------------------------------

bq. Should we deprecate (eventually, remove) Analyzer.tokenStream? 
I would totally agree with that but  I guess we can not remove this method until lucene 4.0
which will be hmm in 2020 :) - just joking

bq.Maybe we should absorb ReusableAnalyzerBase back into Analyzer?
That would be the logical consequence but the problem with ReusableAnalyzerBase is that it
will break bw comapt if moved to Analyzer. It assumes both #reusabelTokenStream and #tokenStream
to be final and introduces a new factory method. Yet, as an analyzer developer you really
want to use the new ReusableAnalyzerBase in favor of Analyzer in 99% of the cases and it will
require you writing half of the code plus gives you reusability of the tokenStream

bp. I think Lucene/Solr/Nutch need to eventually get to this point
Huge +1 from my side. This could also unify the factory pattern solr uses to build tokenstreams.
I would stop right here and ask to discuss it on the dev list, thoughts mike?!



> eliminate pathological performance on StopFilter when using a Set<String> instead
of CharArraySet
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2279
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: thushara wijeratna
>            Priority: Minor
>
> passing a Set<Srtring> to a StopFilter instead of a CharArraySet results in a very
slow filter.
> this is because for each document, Analyzer.tokenStream() is called, which ends up calling
the StopFilter (if used). And if a regular Set<String> is used in the StopFilter all
the elements of the set are copied to a CharArraySet, as we can see in it's ctor:
> public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords,
boolean ignoreCase)
>   {
>     super(input);
>     if (stopWords instanceof CharArraySet) {
>       this.stopWords = (CharArraySet)stopWords;
>     } else {
>       this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
>       this.stopWords.addAll(stopWords);
>     }
>     this.enablePositionIncrements = enablePositionIncrements;
>     init();
>   }
> i feel we should make the StopFilter signature specific, as in specifying CharArraySet
vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter
as they all result in a copy for each invocation of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message