lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Commented: (LUCENE-1801) Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes() first
Date Thu, 13 Aug 2009 02:57:14 GMT


Michael Busch commented on LUCENE-1801:

Patch looks good, Uwe!

When I change Token.clear() to also set the offsets to 0 and the type to DEFAULT_TYPE, then
still 'test-core', 'test-contrib' and 'test-tag' all pass. 
I think we could make that change and add a comment to the backwards-compatibility section
of CHANGES.txt? I think it is the right behavior to reset everything in Tokenizer. 

Also the comment in Token.clear() suggests that the only reason offset and type are not cleared
is that tokenizers usually overwrite them anyways; so we're not changing the suggested behavior,
and I doubt that people are really relying on the fact that offsets and type are currently
not cleared?

So in summary, if we:
- change all tokenizers to call clearAttributes() first in incrementToken(),
- remove clear() from Attribute and leave it in AttributeImpl,
- change Token.clear() to reset all members and add a comment about that in CHANGES.txt,

then everything seems good, or is then there still a problem that I'm missing?

> Tokenizers (which are the source of Tokens) should call AttributeSource.clearAttributes()
> -----------------------------------------------------------------------------------------------
>                 Key: LUCENE-1801
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Task
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 2.9
>         Attachments: LUCENE-1801.patch, LUCENE-1801.patch
> This is a followup for LUCENE-1796:
> {quote}
> Token.clear() used to be called by the consumer... but then it was switched to the producer
here: LUCENE-1101 
> I don't know if all of the Tokenizers in lucene were ever changed, but in any case it
looks like at least some of these bugs were introduced with the switch to the attribute API
- for example StandardTokenizer did clear it's reusableToken... and now it doesn't.
> {quote}
> As alternative to changing all core/contrib Tokenizers to call clearAttributes first,
we could do this in the indexer, what would be a overhead for old token streams that itsself
clear their reusable token. This issue should also update the Javadocs, to clearly state inside, that the source TokenStream (normally the Tokenizer) should clear *all* Attributes.
If it does not do it and e.g. the positionIncrement is changed to 0 by any TokenFilter, but
the filter does not change it back to 1, the TokenStream would stay with 0. If the TokenFilter
would call PositionIncrementAttribute.clear() (because he is responsible), it could also break
the TokenStream, because clear() is a general method for the whole attribute instance. If
e.g. Token is used as AttributeImpl, a call to clear() would also clear offsets and termLength,
which is not wanted. So the source of the Tokenization should rest the attributes to default
> LUCENE-1796 removed the iterator creation cost, so clearAttributes should run fast, but
is an additional cost during Tokenization, as it was not done consistently before, so a small
speed degradion is caused by this, but has nothing to do with the new TokenStream API.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message