lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Created: (LUCENE-1693) AttributeSource/TokenStream API improvements
Date Tue, 16 Jun 2009 09:47:07 GMT
AttributeSource/TokenStream API improvements

                 Key: LUCENE-1693
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
            Reporter: Michael Busch
            Assignee: Michael Busch
            Priority: Minor
             Fix For: 2.9

This patch makes the following improvements to AttributeSource and

- removes the set/getUseNewAPI() methods (including the standard
  ones). Instead by default incrementToken() throws a subclass of
  UnsupportedOperationException. The indexer tries to call
  incrementToken() initially once to see if the exception is thrown;
  if so, it falls back to the old API.

- introduces interfaces for all Attributes. The corresponding
  implementations have the postfix 'Impl', e.g. TermAttribute and
  TermAttributeImpl. AttributeSource now has a factory for creating
  the Attribute instances; the default implementation looks for
  implementing classes with the postfix 'Impl'. Token now implements
  all 6 TokenAttribute interfaces.

- new method added to AttributeSource:
  addAttributeImpl(AttributeImpl). Using reflection it walks up in the
  class hierarchy of the passed in object and finds all interfaces
  that the class or superclasses implement and that extend the
  Attribute interface. It then adds the interface->instance mappings
  to the attribute map for each of the found interfaces.

- AttributeImpl now has a default implementation of toString that uses
  reflection to print out the values of the attributes in a default
  formatting. This makes it a bit easier to implement AttributeImpl,
  because toString() was declared abstract before.

- Cloning is now done much more efficiently in
  captureState. The method figures out which unique AttributeImpl
  instances are contained as values in the attributes map, because
  those are the ones that need to be cloned. It creates a single
  linked list that supports deep cloning (in the inner class
  AttributeSource.State). AttributeSource keeps track of when this
  state changes, i.e. whenever new attributes are added to the
  AttributeSource. Only in that case will captureState recompute the
  state, otherwise it will simply clone the precomputed state and
  return the clone. restoreState(AttributeSource.State) walks the
  linked list and uses the copyTo() method of AttributeImpl to copy
  all values over into the attribute that the source stream
  (e.g. SinkTokenizer) uses. 

The cloning performance can be greatly improved if not multiple
AttributeImpl instances are used in one TokenStream. A user can
e.g. simply add a Token instance to the stream instead of the individual
attributes. Or the user could implement a subclass of AttributeImpl that
implements exactly the Attribute interfaces needed. I think this
should be considered an expert API (addAttributeImpl), as this manual
optimization is only needed if cloning performance is crucial. I ran
some quick performance tests using Tee/Sink tokenizers (which do
cloning) and the performance was roughly 20% faster with the new
API. I'll run some more performance tests and post more numbers then.

Note also that when we add serialization to the Attributes, e.g. for
supporting storing serialized TokenStreams in the index, then the
serialization should benefit even significantly more from the new API
than cloning. 

Also, the TokenStream API does not change, except for the removal 
of the set/getUseNewAPI methods. So the patches in LUCENE-1460
should still work.

All core tests pass, however, I need to update all the documentation
and also add some unit tests for the new AttributeSource
functionality. So this patch is not ready to commit yet, but I wanted
to post it already for some feedback. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message