Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 20164 invoked from network); 21 Oct 2008 20:51:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 Oct 2008 20:51:51 -0000 Received: (qmail 94028 invoked by uid 500); 21 Oct 2008 20:51:47 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 93982 invoked by uid 500); 21 Oct 2008 20:51:47 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 93973 invoked by uid 99); 21 Oct 2008 20:51:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Oct 2008 13:51:47 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Oct 2008 20:50:35 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 4F97C234C22A for ; Tue, 21 Oct 2008 13:50:44 -0700 (PDT) Message-ID: <898114378.1224622244325.JavaMail.jira@brutus> Date: Tue, 21 Oct 2008 13:50:44 -0700 (PDT) From: "Michael Busch (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1422) New TokenStream API In-Reply-To: <693332296.1223974185363.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641583#action_12641583 ] Michael Busch commented on LUCENE-1422: --------------------------------------- {quote} Strictly speaking, it does break backward compatibility. {quote} Yes I agree. But my take here is that the package-private methods are expert methods for which we don't have to guarantee backwards-compatibility the same way we do for public and protected APIs (i. e. only break compatibility in a version change X.Y->(X+1).0). Of course we have to update all contribs. > New TokenStream API > ------------------- > > Key: LUCENE-1422 > URL: https://issues.apache.org/jira/browse/LUCENE-1422 > Project: Lucene - Java > Issue Type: New Feature > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1422.patch, lucene-1422.take2.patch, lucene-1422.take3.patch, lucene-1422.take3.patch > > > This is a very early version of the new TokenStream API that > we started to discuss here: > http://www.gossamer-threads.com/lists/lucene/java-dev/66227 > This implementation is a bit different from what I initially > proposed in the thread above. I introduced a new class called > AttributedToken, which contains the same termBuffer logic > from Token. In addition it has a lazily-initialized map of > Class -> Attribute. Attribute is also a > new class in a new package, plus several implementations like > PositionIncrementAttribute, PayloadAttribute, etc. > Similar to my initial proposal is the prototypeToken() method > which the consumer (e. g. DocumentsWriter) needs to call. > The token is created by the tokenizer at the end of the chain > and pushed through all filters to the end consumer. The > tokenizer and also all filters can add Attributes to the > token and can keep references to the actual types of the > attributes that they need to read of modify. This way, when > boolean nextToken() is called, no casting is necessary. > I added a class called TestNewTokenStreamAPI which is not > really a test case yet, but has a static demo() method, which > demonstrates how to use the new API. > The reason to not merge Token and TokenStream into one class > is that we might have caching (or tee/sink) filters in the > chain that might want to store cloned copies of the tokens > in a cache. I added a new class NewCachingTokenStream that > shows how such a class could work. I also implemented a deep > clone method in AttributedToken and a > copyFrom(AttributedToken) method, which is needed for the > caching. Both methods have to iterate over the list of > attributes. The Attribute subclasses itself also have a > copyFrom(Attribute) method, which unfortunately has to down- > cast to the actual type. I first thought that might be very > inefficient, but it's not so bad. Well, if you add all > Attributes to the AttributedToken that our old Token class > had (like offsets, payload, posIncr), then the performance > of the caching is somewhat slower (~40%). However, if you > add less attributes, because not all might be needed, then > the performance is even slightly faster than with the old API. > Also the new API is flexible enough so that someone could > implement a custom caching filter that knows all attributes > the token can have, then the caching should be just as > fast as with the old API. > This patch is not nearly ready, there are lot's of things > missing: > - unit tests > - change DocumentsWriter to use new API > (in backwards-compatible fashion) > - patch is currently java 1.5; need to change before > commiting to 2.9 > - all TokenStreams and -Filters should be changed to use > new API > - javadocs incorrect or missing > - hashcode and equals methods missing in Attributes and > AttributedToken > > I wanted to submit it already for brave people to give me > early feedback before I spend more time working on this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org