lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <>
Subject [jira] Commented: (LUCENE-1906) Problem with CharStream and Tokenizers with custom reset(Reader) method
Date Thu, 10 Sep 2009 17:43:57 GMT


Mark Miller commented on LUCENE-1906:

bq. A cache for what? I do not understand  The problem is the change of the member field.

Sorry - I meant to say reflection. I mean check if the instance just overrides reset(Reader),
and if so, call it with the CharReader from reset(Reader) (and cache the reflection results)
- perhaps it gets more complicated though - I havn't thought on it - just throwing it out.
I've glossed over the code involved but I'm still at a pretty high level with it, as I trust
there is enough people that already know what they are doing in this area. I was never the
kid in a group racing to have the math answer first ;) 

bq. Correct, this is always the problem with this change. The offsets are wrong if one legacy
Tokenizer does not correct the offset (either through Tokenizer.correctOffset() or CharStream.correctOffset()).

Okay - I took it as - this is 100% backwards compatible, except for this exception where it
is not. But if thats always been the case, it is back compat and it sounds fine to me - just
got caught up in the wording.

So if its good, lets ship it! It will take me a bit of time to ensure all the tests and package
up, and transfer over again (man does all that maven stuff make transfer take forever - too
many little files)

> Problem with CharStream and Tokenizers with custom reset(Reader) method
> -----------------------------------------------------------------------
>                 Key: LUCENE-1906
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>            Priority: Blocker
>             Fix For: 2.9
>         Attachments: backwards-break.patch, LUCENE-1906.patch
> When reviewing the new CharStream code added to Tokenizers, I found a
> serious problem with backwards compatibility and other Tokenizers, that do
> not override reset(CharStream).
> The problem is, that e.g. CharTokenizer only overrides reset(Reader):
> {code}
>   public void reset(Reader input) throws IOException {
>     super.reset(input);
>     bufferIndex = 0;
>     offset = 0;
>     dataLen = 0;
>   }
> {code}
> If you reset such a Tokenizer with another CharStream (not a Reader), this
> method will never be called and breaking the whole Tokenizer.
> As CharStream extends Reader, I propose to remove this reset(CharStream
> method) and simply do an instanceof check to detect if the supplied Reader
> is no CharStream and wrap it. We could also remove the extra ctor (because
> most Tokenizers have no support for passing CharStreams). If the ctor also
> checks with instanceof and warps as needed the code is backwards compatible
> and we do not need to add additional ctors in subclasses.
> As this instanceof check is always done in CharReader.get() why not remove
> ctor(CharStream) and reset(CharStream) completely?
> Any thoughts?
> I would like to fix this somehow before RC4, I'm, sorry :(

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message