lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <>
Subject [jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
Date Wed, 17 Jun 2009 05:56:07 GMT


Uwe Schindler commented on LUCENE-1693:

bq. is your patch backwards-compatible if a user has a TokenStream or -Filter which returns
a custom subclass of Token? And then another one in the chain casts to that subclass? Note
that Token is not final. Also not sure how common this scenario is, just came to my mind.

This is no problem; I can explain:
Returning something other than Token from next() only makes sense, if the *direct* consumer
can handle it. So A TokenStream that returns these tokens must then be wrapped by a TokenFilter
that can handle these Tokens. If there would be any other TokenFilter in between, it is not
guaranteed, that this TokenFilter also returns the special Token. When you have this direct
relation, the calls from the Filter to the prducers method are directly without wrapping (because
both implement the old API). At the point where the indexer consumes the TokenFilter on top,
the custom Token is uninteresting and can safely copied into a conventional Token (which is
done because nextToken!=attributeInstance).

bq. Also, can a user still use the AttributeFactory to use something else but Token?

As noted in my patch description: This is not possible. One can add additional attributes
and use them in his chain (even when old filters are in between, which only handle the Token
instance). TokenStream and TokenFilter creates a Token() instance on initialize() and call
AddAttributeImpl(). After doing this, it checks, if all Attributes are then subclass of Token,
and calls getAttribute(TermAttribute) is called and the result casted to Token (which then
should be the same).

One could change this behaviour if he overrides initialize() in one of his classes, but then
another TokenSteam/Filter in the chain also initializing, will see, that one of the instances
is *not* Token and will throw a RuntimeException. I tried everything, to be able to handle
both pathes (old -> new API, new -> old API), TokenStream and TokenFilter must have
a Token instance. In 3.0 or later this can be removed and we will only use the factory to
init the attributes.

In my opinion, this is not a problem, because one could still add custom attributes to his
chain and the best: he can mix old and new tokenstreams in one chain as he want. The missing
flexibility in modifying the instances of Tokenattributes are in my opinion not important
(and one instance initializes faster than 5).

> AttributeSource/TokenStream API improvements
> --------------------------------------------
>                 Key: LUCENE-1693
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: LUCENE-1693.patch, lucene-1693.patch
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message