lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1422) New TokenStream API
Date Thu, 13 Nov 2008 17:48:46 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647346#action_12647346
] 

Michael Busch commented on LUCENE-1422:
---------------------------------------

{quote}
Looks good! 
{quote}
Thanks for reviewing ... again!

{quote}
I'm a little confused by 'start' and 'initialize' in TokenStream.
{quote}
Good question. The only reason why I added two separate methods here was to enforce that TokenFilter#start()
always calls input.start() first. I think otherwise this will be a pitfall for people who
want to implement their own filters and override start(). (happened to me when I modified
the tests a couple of times)
On the other hand I agree that Three methods (start(), initialize(), reset()) are confusing.
I guess we can deprecate reset() in the future and have start() rewind the stream if supported
by the stream. (I think you mentioned a boolean method could return true if rewind is supported?)
So I'd be ok with collapsing start/initialize into one method if you prefer that. I'd just
have to add an explicit warning to the javadocs in TokenFilter.

{quote}
I'm seeing this failure (in test I just committed this AM). I think
it's OK, because the new API is enabled for all tests and I'm using
the old API with that analyzer?
{quote}
Yeah, I've to modify the new test to use the new API.

{quote}
- BackCompatTokenStream is calling attributes.put directly but all
others use super's addAttribute.
- Why is BackCompatTokenStream overriding so many methods? EG
has/get/addAttribute - won't super do the same thing?
{quote}
I had a different implementation of BCTS, so this is left-over. Of course you're right, I
can simplify that class.

{quote}
Maybe add reasons to some of the asserts, eg StopFilter has
"assert termAtt != null", so maybe append to that ": initialize()
wasn't called".
{quote}
Yeah good idea, will do.

> New TokenStream API
> -------------------
>
>                 Key: LUCENE-1422
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1422
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: lucene-1422-take4.patch, lucene-1422-take5.patch, lucene-1422.patch,
lucene-1422.take2.patch, lucene-1422.take3.patch, lucene-1422.take3.patch
>
>
> This is a very early version of the new TokenStream API that 
> we started to discuss here:
> http://www.gossamer-threads.com/lists/lucene/java-dev/66227
> This implementation is a bit different from what I initially
> proposed in the thread above. I introduced a new class called
> AttributedToken, which contains the same termBuffer logic 
> from Token. In addition it has a lazily-initialized map of
> Class<? extends Attribute> -> Attribute. Attribute is also a
> new class in a new package, plus several implementations like
> PositionIncrementAttribute, PayloadAttribute, etc.
> Similar to my initial proposal is the prototypeToken() method
> which the consumer (e. g. DocumentsWriter) needs to call.
> The token is created by the tokenizer at the end of the chain
> and pushed through all filters to the end consumer. The 
> tokenizer and also all filters can add Attributes to the 
> token and can keep references to the actual types of the
> attributes that they need to read of modify. This way, when
> boolean nextToken() is called, no casting is necessary.
> I added a class called TestNewTokenStreamAPI which is not 
> really a test case yet, but has a static demo() method, which
> demonstrates how to use the new API.
> The reason to not merge Token and TokenStream into one class 
> is that we might have caching (or tee/sink) filters in the 
> chain that might want to store cloned copies of the tokens
> in a cache. I added a new class NewCachingTokenStream that
> shows how such a class could work. I also implemented a deep
> clone method in AttributedToken and a 
> copyFrom(AttributedToken) method, which is needed for the 
> caching. Both methods have to iterate over the list of 
> attributes. The Attribute subclasses itself also have a
> copyFrom(Attribute) method, which unfortunately has to down-
> cast to the actual type. I first thought that might be very
> inefficient, but it's not so bad. Well, if you add all
> Attributes to the AttributedToken that our old Token class
> had (like offsets, payload, posIncr), then the performance
> of the caching is somewhat slower (~40%). However, if you 
> add less attributes, because not all might be needed, then
> the performance is even slightly faster than with the old API.
> Also the new API is flexible enough so that someone could
> implement a custom caching filter that knows all attributes
> the token can have, then the caching should be just as 
> fast as with the old API.
> This patch is not nearly ready, there are lot's of things 
> missing:
> - unit tests
> - change DocumentsWriter to use new API 
>   (in backwards-compatible fashion)
> - patch is currently java 1.5; need to change before 
>   commiting to 2.9
> - all TokenStreams and -Filters should be changed to use 
>   new API
> - javadocs incorrect or missing
> - hashcode and equals methods missing in Attributes and 
>   AttributedToken
>   
> I wanted to submit it already for brave people to give me 
> early feedback before I spend more time working on this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message