lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements
Date Wed, 15 Jul 2009 06:35:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731295#action_12731295
] 

Uwe Schindler edited comment on LUCENE-1693 at 7/14/09 11:34 PM:
-----------------------------------------------------------------

bq. However, if people are using next() in a chain with core streams/filters, then every single
token will now be cloned, possibly multiple times, right?

That was a similar problem in 2.4, whenever you call next() to consume a stream, every token
is a new instance. Only cloning WAS not needed but its now needed for BW compatibility ("full
private copy").

In my opinion, this is neglectible, as indexing speed is important, and the indexer always
uses incrementToken(). If somebody written an own query parser that uses next() to consume
speed is not really important. And it is deprecated and in the docs (even in 2.4) stands:
"but will be slower than calling next(Token)".

There is one case, when it also affects you (during indexing). If you have an old-style tokenfilter
that calls next() on the next stream that is new-api, it would clone. In my opinion, the speed
is about the same like before:
- the 2.4 code created a new uninitialized token instance and the filter then filled it with
data, you have to initialize the char arrays and so on.
- here the token from the TokenWrapper-Attribute is reused (no allocation costs for arrays
and so on), but you have to clone the Token (to be full private).

bq. So if people are using streams/filters that implement next(Token) I think the performance
should be comparable - even though there's also a (hopefully small) performance hit to expect
because of more method calls and if checks.

I found no performance hit, it is about same speed. The varieties between tests is bigger
than a measureable performance impact. The other sensitive thing (TokenWrapper): The wrapping
using TokenWrapper was in the original indexing code, too (this BackwardsCompatibilityStream
private class).

The if checks are all on final variables, so can be optimized away by the JVM. The method
calls are inlined, as far as I have seen.

bq. It's failing right now, because it's testing a lot of different combinations. I need to
check if all of those different tests are actually valid, because we're saying you can't use
Tee/Sink with the new API anymore. 

Have you seen my backwards compatibility test, too? It is a copy of yours (with some variation)?
The Lucene24* classes were removed, because Tee/SinkTokenizer werde reverted to their original
2.4 status in the patch (only implement old API).
As far as I see (not yet tried out), you try to test new-style-API streams with the old Tee/Sink
tokenizer, that is deprecated. You were not able to do this before 2.9 (no new API) and so
the bw problem is not there. If you rewrite your streams with new API, you should use TeeSinkTokenizer,
too.

      was (Author: thetaphi):
    bq. However, if people are using next() in a chain with core streams/filters, then every
single token will now be cloned, possibly multiple times, right?

That was a similar problem in 2.4, whenever you call next() to consume a stream, every token
is a new instance. Only cloning WAS not needed but its now needed for BW compatibility ("full
private copy").

In my opinion, this is neglectible, as indexing speed is important, and the indexer always
uses incrementToken(). If somebody written an own query parser that uses next() to consume
speed is not really important. And it is deprecated and in the docs (even in 2.4) stands:
"but will be slower than calling next(Token)".

bq. So if people are using streams/filters that implement next(Token) I think the performance
should be comparable - even though there's also a (hopefully small) performance hit to expect
because of more method calls and if checks.

I found no performance hit, it is about same speed. The varieties between tests is bigger
than a measureable performance impact. The other sensitive thing (TokenWrapper): The wrapping
using TokenWrapper was in the original indexing code, too (this BackwardsCompatibilityStream
private class).

bq. It's failing right now, because it's testing a lot of different combinations. I need to
check if all of those different tests are actually valid, because we're saying you can't use
Tee/Sink with the new API anymore. 

Have you seen my backwards compatibility test, too? It is a copy of yours (with some variation)?
The Lucene24* classes were removed, because Tee/SinkTokenizer werde reverted to their original
2.4 status in the patch (only implement old API).
As far as I see (not yet tried out), you try to test new-style-API streams with the old Tee/Sink
tokenizer, that is deprecated. You were not able to do this before 2.9 (no new API) and so
the bw problem is not there. If you rewrite your streams with new API, you should use TeeSinkTokenizer,
too.
  
> AttributeSource/TokenStream API improvements
> --------------------------------------------
>
>                 Key: LUCENE-1693
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1693
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java,
TestCompatibility.java, TestCompatibility.java
>
>
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message