lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <>
Subject [jira] [Updated] (LUCENE-5803) Add another AnalyzerWrapper class that does not have its own cache, so delegate-only wrappers don't create thread local resources several times
Date Thu, 03 Jul 2014 23:47:34 GMT


Uwe Schindler updated LUCENE-5803:

    Attachment: LUCENE-5803.patch


I added more Javadocs and tried to work around the stupid problem with the super constructor
call cannot reference to {{this}}. There is the possibibility to do this by using the passed-in
Analyzer, but then we loose the check throwing the IllegalStateException.

We need this check, otherwise you would be able to corrumpt your analyzers: If you wrap this
analyzer again with some other analyzer that uses the delegate reuse strategy, e.g., {{new
ShingleAnalysisWrapper(new PerFieldAnalyzerWrapper(....))}}, the ShingleAnalysisWrapper would
reuse the PerFieldAnalyzerWrapper's strategy (which is private to the PerFieldAnalysis wrapper)
and by that inject illegal TokenStreamComponents into the inner's cache. So we must disallow

This patch misses some tests for this special case and also to test if everything works fine.

Solr is also using this Analyzer, so we see the improvements in Solr, too (not only in Elasticsearch).
In fact, PER_FIELD_REUSE_STRATEGY is no longer used for pure per-field delegates. We no longer
have one TokenStream instance per field, we have one instance per delegate Analyzer.

> Add another AnalyzerWrapper class that does not have its own cache, so delegate-only
wrappers don't create thread local resources several times
> -----------------------------------------------------------------------------------------------------------------------------------------------
>                 Key: LUCENE-5803
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.9
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 5.0, 4.10
>         Attachments: LUCENE-5803.patch
> This is a followup issue for the following Elasticsearch issue:
> Basically the problem is the following:
> - Elasticsearch has a pool of Analyzers that are used for analysis in several indexes
> - Each index uses a different PerFieldAnalyzerWrapper
> PerFieldAnalyzerWrapper uses PER_FIELD_REUSE_STRATEGY. Because of this it caches the
tokenstreams for every field. If there are many fields, this are a lot. In addition, the underlying
analyzers may also cache tokenstreams and other PerFieldAnalyzerWrappers do the same, although
the delegate Analyzer can always return the same components.
> We should add similar code to Elasticsearch's directly to Lucene: If the delegating Analyzer
just delegates per Field or just wraps CharFilters around the Reader, there is no need to
cache the TokenStreamComponents a second time in the delegating Analyzers. This is only needed,
if the delegating Analyzers adds additional TokenFilters (like ShingleAnalyzerWrapper).
> We should name this new class DelegatingAnalyzerWrapper extends AnalyzerWrapper. The
wrapComponents method must be final, because we are not allowed to add additional TokenFilters,
but unlike ES, we don't need to disallow wrapping with CharFilters.
> Internally this class uses a private ReuseStrategy that just delegates to the underlying
analyzer. It does not matter here if the strategy of the delegate is global or per field,
this is private to the delegate.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message