lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set<String> instead of CharArraySet
Date Wed, 24 Feb 2010 14:06:28 GMT


Michael McCandless commented on LUCENE-2279:

bq. I would stop right here and ask to discuss it on the dev list, thoughts mike?!

Agreed... I'll start a thread.

bq. Maybe we should absorb ReusableAnalyzerBase back into Analyzer?

That would be the logical consequence but the problem with ReusableAnalyzerBase is that it
will break bw comapt if moved to Analyzer.

Right, this is why I was thinking if we make a new analyzers package, it's a chance to break/improve
things.  We'd have a single abstract base class that only exposes reuse API.

bq. in my opinion all the core analyzers (you already fixed contrib) should be final. 

I agree, and we should consistently take this approach w/ the new analyzers package...

bq. i still don't quite understand how it gives us more freedom to break/change the APIs,
i mean however we label this stuff, a break is a break to the user at the end of the day.

Because it'd be an entirely new package, so we can create a new base Analyzer class (in that
package) that breaks/fixes things when compared to Lucene's Analyzer class.

We'd eventually deprecate the analyzers/tokenizers/token filters in Lucene/Solr/Nutch in favor
of this new package, and users can switch over on their own schedule.

> eliminate pathological performance on StopFilter when using a Set<String> instead
of CharArraySet
> -------------------------------------------------------------------------------------------------
>                 Key: LUCENE-2279
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: thushara wijeratna
>            Priority: Minor
> passing a Set<Srtring> to a StopFilter instead of a CharArraySet results in a very
slow filter.
> this is because for each document, Analyzer.tokenStream() is called, which ends up calling
the StopFilter (if used). And if a regular Set<String> is used in the StopFilter all
the elements of the set are copied to a CharArraySet, as we can see in it's ctor:
> public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords,
boolean ignoreCase)
>   {
>     super(input);
>     if (stopWords instanceof CharArraySet) {
>       this.stopWords = (CharArraySet)stopWords;
>     } else {
>       this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
>       this.stopWords.addAll(stopWords);
>     }
>     this.enablePositionIncrements = enablePositionIncrements;
>     init();
>   }
> i feel we should make the StopFilter signature specific, as in specifying CharArraySet
vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter
as they all result in a copy for each invocation of Analyzer.tokenStream().

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message