lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-2100) Make contrib analyzers final
Date Tue, 17 May 2011 04:37:47 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034549#comment-13034549
] 

Robert Muir commented on LUCENE-2100:
-------------------------------------

Esmond: hi, what you are doing here is exactly the reason why we made it final.

By subclassing StandardAnalyzer in this way, the indexer is no longer able to reuse tokenstreams,
making analysis very slow and inefficient.

The easiest way to get your PorterStemAnalyzer is to just use EnglishAnalyzer, which does
just this.

Otherwise if you really want to do it yourself, do it like this:
{noformat}
Analyzer analyzer = new ReusableAnalyzerBase() {
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    Tokenizer tokenizer = new StandardTokenizer(...);
    TokenStream filteredStream = new StandardFilter(tokenizer, ...);
    filteredStream = new LowerCaseFilterFilter(filteredStream, ...);
    filteredStream = new StopFilterFilter(filteredStream, ...);
    filteredStream = new PorterStemFilter(filteredStream, ...);
    return new TokenStreamComponents(tokenizer, filteredStream);
  }
};
{noformat}

Please see LUCENE-3055 for more examples and a more thorough explanation.

The good news is if you implement your analyzer like this, you will see performance improvements!


> Make contrib analyzers final
> ----------------------------
>
>                 Key: LUCENE-2100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2100
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, 2.9, 2.9.1,
3.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2100.patch, LUCENE-2100.patch
>
>
> The analyzers in contrib/analyzers should all be marked final. None of the Analyzers
should ever be subclassed - users should build their own analyzers if a different combination
of filters and Tokenizers is desired.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message