lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set<String> instead of CharArraySet
Date Wed, 24 Feb 2010 12:36:28 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837759#action_12837759
] 

Robert Muir commented on LUCENE-2279:
-------------------------------------

bq. Yet, as an analyzer developer you really want to use the new ReusableAnalyzerBase in favor
of Analyzer in 99% of the cases and it will require you writing half of the code plus gives
you reusability of the tokenStream

and the 1% extremely advanced cases that can't reuse, can just use TokenStreams directly when
indexing, e.g. the Analyzer class could be reusable by definition. we shouldnt let these obscure
cases slow down everyone else.

bq. It assumes both #reusabelTokenStream and #tokenStream to be final

in my opinion all the core analyzers (you already fixed contrib) should be final. this is
another trap, if you subclass one of these analyzers and implement 'tokenStream', its immediately
slow due to the backwards code.

bq. I think Lucene/Solr/Nutch need to eventually get to this point

if this is what we should do to remove the code duplication, then i am all for it. i still
don't quite understand how it gives us more freedom to break/change the APIs, i mean however
we label this stuff, a break is a break to the user at the end of the day.

> eliminate pathological performance on StopFilter when using a Set<String> instead
of CharArraySet
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2279
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: thushara wijeratna
>            Priority: Minor
>
> passing a Set<Srtring> to a StopFilter instead of a CharArraySet results in a very
slow filter.
> this is because for each document, Analyzer.tokenStream() is called, which ends up calling
the StopFilter (if used). And if a regular Set<String> is used in the StopFilter all
the elements of the set are copied to a CharArraySet, as we can see in it's ctor:
> public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords,
boolean ignoreCase)
>   {
>     super(input);
>     if (stopWords instanceof CharArraySet) {
>       this.stopWords = (CharArraySet)stopWords;
>     } else {
>       this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
>       this.stopWords.addAll(stopWords);
>     }
>     this.enablePositionIncrements = enablePositionIncrements;
>     init();
>   }
> i feel we should make the StopFilter signature specific, as in specifying CharArraySet
vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter
as they all result in a copy for each invocation of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message