lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1794) implement reusableTokenStream for all contrib analyzers
Date Fri, 14 Aug 2009 12:13:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743183#action_12743183
] 

Shai Erera commented on LUCENE-1794:
------------------------------------

We only need getTokenizer because TokenStream.reset() does not accept a Reader. If we could
introduce such method on TokenStream, we wouldn't need to refer to Tokenizer directly.

bq. do you have any ideas on the back compat issues?

Well it's a bit trickier ... today we call reusableTokenStream in our indexing code, and either
get a new instance, or a reused instance. We cannot change Analyzer's default behavior, which
returns a new instance (unless we're willing to break back-compat), because Analyzers that
did not override reusableTokenStream, may break if we start reusing the instance by default
(for example if they add two fields to a document w/ reusableTokenStream called twice).

Also, deprecate reusableTokenStream and define a new one (say reuseTokenStream), and move
to use it is not good either, since we want its default impl to reuse the token stream, and
impls that did not override it may break.

So how about if we create a new abstract ReusingAnalyzer which impls reusableTokenStream to
always reuse it. And we add Streams to Analyzer as a protected static class. That way, Analyzers
that don't care about reuse, can still extend Analyzer. Analyzers which care about reuse and
are fine w/ ReusingAnalyzer's impl, can move to extend it. And Analyzers that care about reuse
but want their reuse to be done differently can choose to extend ReusingAnalyzer, or Analyzer.

Back-compat wise, we're safe since:
# Existing Lucene Analyzers that reuse can be changed to extend ReusingAnalyzer.
# Existing Analyzers (outside Lucene code) either override or not reusableTokenStream, and
therefore won't break.
# Our indexing code will still call reusableTokenStream, no change here.
# Any code out there which traverses an Analyzer by calling reusableTokenStream does not need
to change anything.

I think that'd work?

> implement reusableTokenStream for all contrib analyzers
> -------------------------------------------------------
>
>                 Key: LUCENE-1794
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1794
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1794.patch, LUCENE-1794.patch, LUCENE-1794.patch, LUCENE-1794.patch,
LUCENE-1794.patch
>
>
> most contrib analyzers do not have an impl for reusableTokenStream
> regardless of how expensive the back compat reflection is for indexing speed, I think
we should do this to mitigate any performance costs. hey, overall it might even be an improvement!
> the back compat code for non-final analyzers is already in place so this is easy money
in my opinion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message