lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1906) Problem with CharStream and Tokenizers with custom reset(Reader) method
Date Thu, 10 Sep 2009 18:46:57 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753762#action_12753762
] 

Michael McCandless commented on LUCENE-1906:
--------------------------------------------

bq. A recompile is only needed is rare caces (if you override Scorers and so on

Or implement Searchable (an interface that we've added methods to), or implemented Weight
(an interface that we changed to an abstract class), or your own MergePolicy (IndexWriter
instance now required to ctor).  I agree these ones are way-expert things.

{quote}
bq. In my opinion, e.g. external language Tokenizer-Packages (as Michael Busch calls them)
without source code would not work. This example is always brought by Michael.

Excellent point. Hadn't seen it before or didn't remember it.
{quote}

OK I agree, this does make me nervous, too.

OK you've convinced me Uwe: I think we should in fact restore input to be a Reader not a CharStream.
 I think the potential performance hit is the lesser evil here.

Maybe for 3.0 we can declare that this will become a CharStream?

> Problem with CharStream and Tokenizers with custom reset(Reader) method
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1906
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1906
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>            Priority: Blocker
>             Fix For: 2.9
>
>         Attachments: backwards-break.patch, LUCENE-1906.patch, LUCENE-1906.patch, LUCENE-1906_contrib.patch
>
>
> When reviewing the new CharStream code added to Tokenizers, I found a
> serious problem with backwards compatibility and other Tokenizers, that do
> not override reset(CharStream).
> The problem is, that e.g. CharTokenizer only overrides reset(Reader):
> {code}
>   public void reset(Reader input) throws IOException {
>     super.reset(input);
>     bufferIndex = 0;
>     offset = 0;
>     dataLen = 0;
>   }
> {code}
> If you reset such a Tokenizer with another CharStream (not a Reader), this
> method will never be called and breaking the whole Tokenizer.
> As CharStream extends Reader, I propose to remove this reset(CharStream
> method) and simply do an instanceof check to detect if the supplied Reader
> is no CharStream and wrap it. We could also remove the extra ctor (because
> most Tokenizers have no support for passing CharStreams). If the ctor also
> checks with instanceof and warps as needed the code is backwards compatible
> and we do not need to add additional ctors in subclasses.
> As this instanceof check is always done in CharReader.get() why not remove
> ctor(CharStream) and reset(CharStream) completely?
> Any thoughts?
> I would like to fix this somehow before RC4, I'm, sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message