lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1422) New TokenStream API
Date Wed, 29 Oct 2008 11:51:44 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643472#action_12643472
] 

Michael McCandless commented on LUCENE-1422:
--------------------------------------------


I like this new "idempotent" approach!

I reviewed the patch.  It looks good!:

  * It seems like AttributeSource is fairly generic -- mabye it should
    go into util?  It seems like we could eventually use the same
    approach for extensibility to index writing/reading APIs?

  * Performance -- have you compared before/after simple tokenization
    speeds?  Eg, tokenize all docs in Wikipedia w/ different
    analyzers.  contrib/benchmark makes this easy now.

  * I assume we would statically default TokenStream.useNewAPI to
    true?

  * You are using a Token (perThreadlocalToken) and a
    BackwardsCompatibilityStream when indexing a NOT_ANALYZED Field,
    in DocInverterPerField -- can you fix that code to use
    not-deprecated new stuff?  EG maybe make a SingleTokenTokenStream
    or something that's re-used for this purpose?

  * Once we can set useNewAPI per TokenStream instance, what happens
    if someone builds up an analyzer chain that mixes old & new
    TokenStream/Filters?  Are all permutations OK?

  * Are DocInverter.BackwardsCompatibilityStream and
    TokenStreamTestUtils.BackwardsCompatibleStream basically the same
    thing?  If so, can we merge them?  It seems like others may needs
    this (when mixing old/new analyzer chains -- the question above).

  * Can you add back-compat-future-freedom warnings to these new APIs?
    (Ie "these are new & subject to change")

  * Do we really need the separate subclass TokenStreamState?  Should
    we absorb its methods into AttributeSource?

  * In DocInverterPerField you check if the attrSource has posIncr &
    offset attrs, instead of just looking them up and getting the
    default instance if it wasn't already there.  This then requires
    you to default posIncr & offsets in two places --
    DocInverterPerField & each of the attr classes.  Wouldn't it be
    cleaner to always look them up, and let the default instance of
    each be the one place that holds the default?

  * Some files are missing copyright header.  Also, the comments at the
    top of TermAttribute.java & TypeAttribute.java should be fixed
    (they are about pos incr now).  Also there is a "must should"
    in the javadocs in TokenStrea.java.

  * On the new TokenStream.start method: is a TokenStream allowed to
    not allow more than 1 start invocation?  Meaning, it cannot repeat
    itself.  Just like we are grappling with on DocIdset.iterator()
    it'd be good to address this semantics up front.

  * Many unit tests had to be changed with this patch.  Was this
    entirely due to avoiding the newly deprecated APIs, or, are there
    any tests that fail if they were not changed?  (This is a simple
    check to run, actually -- only apply your core changes, then run
    all tests).  If it's the former (which it should be), maybe we
    should leave a few tests using the deprecated APIs to ensure we
    don't break them before 3.0?


> New TokenStream API
> -------------------
>
>                 Key: LUCENE-1422
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1422
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: lucene-1422-take4.patch, lucene-1422.patch, lucene-1422.take2.patch,
lucene-1422.take3.patch, lucene-1422.take3.patch
>
>
> This is a very early version of the new TokenStream API that 
> we started to discuss here:
> http://www.gossamer-threads.com/lists/lucene/java-dev/66227
> This implementation is a bit different from what I initially
> proposed in the thread above. I introduced a new class called
> AttributedToken, which contains the same termBuffer logic 
> from Token. In addition it has a lazily-initialized map of
> Class<? extends Attribute> -> Attribute. Attribute is also a
> new class in a new package, plus several implementations like
> PositionIncrementAttribute, PayloadAttribute, etc.
> Similar to my initial proposal is the prototypeToken() method
> which the consumer (e. g. DocumentsWriter) needs to call.
> The token is created by the tokenizer at the end of the chain
> and pushed through all filters to the end consumer. The 
> tokenizer and also all filters can add Attributes to the 
> token and can keep references to the actual types of the
> attributes that they need to read of modify. This way, when
> boolean nextToken() is called, no casting is necessary.
> I added a class called TestNewTokenStreamAPI which is not 
> really a test case yet, but has a static demo() method, which
> demonstrates how to use the new API.
> The reason to not merge Token and TokenStream into one class 
> is that we might have caching (or tee/sink) filters in the 
> chain that might want to store cloned copies of the tokens
> in a cache. I added a new class NewCachingTokenStream that
> shows how such a class could work. I also implemented a deep
> clone method in AttributedToken and a 
> copyFrom(AttributedToken) method, which is needed for the 
> caching. Both methods have to iterate over the list of 
> attributes. The Attribute subclasses itself also have a
> copyFrom(Attribute) method, which unfortunately has to down-
> cast to the actual type. I first thought that might be very
> inefficient, but it's not so bad. Well, if you add all
> Attributes to the AttributedToken that our old Token class
> had (like offsets, payload, posIncr), then the performance
> of the caching is somewhat slower (~40%). However, if you 
> add less attributes, because not all might be needed, then
> the performance is even slightly faster than with the old API.
> Also the new API is flexible enough so that someone could
> implement a custom caching filter that knows all attributes
> the token can have, then the caching should be just as 
> fast as with the old API.
> This patch is not nearly ready, there are lot's of things 
> missing:
> - unit tests
> - change DocumentsWriter to use new API 
>   (in backwards-compatible fashion)
> - patch is currently java 1.5; need to change before 
>   commiting to 2.9
> - all TokenStreams and -Filters should be changed to use 
>   new API
> - javadocs incorrect or missing
> - hashcode and equals methods missing in Attributes and 
>   AttributedToken
>   
> I wanted to submit it already for brave people to give me 
> early feedback before I spend more time working on this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message