lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
Date Tue, 20 Nov 2007 20:44:43 GMT


Michael McCandless commented on LUCENE-1063:

Looking at the test, this would not have worked before token-reuse either.... I don't yet
see how we are breaking backward compatibility.
Callers of next() could change the Token, so caching your own copy that you already passed
on to someone else was never valid.

You're right: before token reuse a filter could change the String
termText (and other fields) and mess up a cached copy held by a
TokenStream earlier in the chain.

But, our core filters now use the reuse API (for better performance),
so if you are using a TokenStream that does caching followed by one of
these core filters we will now mess up the cached copy, right?

Oh, duh: I just checked 2.2 and in fact the LowerCaseFilter,
PorterStemFilter, ISOLatin1AccentFilter all directly alter termText
rather than making a new token.

So actually this issue is pre-existing!

And then I guess we are not breaking backwards compatibility by
further propagating it.

But I think this is still a bug?

Hmm, I guess the semantics of the next() API is and has been to allow
you to arbitrarily modify the token after you receive it ("forwards
reuse") but not re-use the token on the next call to next ("backwards
reuse").  If we take that approach then the bug is in those
TokenStreams that cache tokens without "protecting" their private copy
when next() is called?

> Token re-use API breaks back compatibility in certain TokenStream chains
> ------------------------------------------------------------------------
>                 Key: LUCENE-1063
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.3
>         Attachments: LUCENE-1063.patch
> In scrutinizing the new Token re-use API during this thread:
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>      to change all aspects of the provided Token, meaning the caller
>      must do all persisting of Token that it needs before calling
>      next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>      Token however it wants.  Eg the LowerCaseFilter is allowed to
>      downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message