Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 45459 invoked from network); 21 Aug 2009 05:34:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 21 Aug 2009 05:34:19 -0000 Received: (qmail 711 invoked by uid 500); 21 Aug 2009 05:34:37 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 523 invoked by uid 500); 21 Aug 2009 05:34:36 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 446 invoked by uid 99); 21 Aug 2009 05:34:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Aug 2009 05:34:36 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Aug 2009 05:34:34 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id D894929A0012 for ; Thu, 20 Aug 2009 22:34:14 -0700 (PDT) Message-ID: <1123816533.1250832854886.JavaMail.jira@brutus> Date: Thu, 20 Aug 2009 22:34:14 -0700 (PDT) From: "Uwe Schindler (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1826) All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory In-Reply-To: <537690999.1250789895167.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745803#action_12745803 ] Uwe Schindler commented on LUCENE-1826: --------------------------------------- bq. without the Tokenizer.reset(Reader, AttributeSource), i won't be able to reuse Tokenizer instances (will have to create a fresh one each time) This is not possible per design. The AttributeSource cannot be changed. It is created during creation of the classes (this is why it is in the ctor and nowhere else). For filters, the attributes come from the input token stream. bq. Is the reflection penalty on the new TokenStream stuff incurred per root AttributeSource?, or per TokenFilter/TokenStream? The reflection penalty is one-time per class (because of static cache of "known" classes), so all attributeimpl are inspected one time when a new AttributeSouce like TokenStream is created. There is an additional reflection cost, when new attributes are added, but also one time per AttributeImpl class. Since the last changes in TokenStream the reflection is therefore no longer a penalty. The only problem is more work to construct an TokenStream (filling the LinkedHashMaps), because of that you should reuse TokenStream-chains. bq. that is, if i pass the same AttributeSource to 10 TokenStreams, is the reflection cost the same as if i passed it to just one? No change! > All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory > ----------------------------------------------------------------------------------------------------- > > Key: LUCENE-1826 > URL: https://issues.apache.org/jira/browse/LUCENE-1826 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Affects Versions: 2.9 > Reporter: Tim Smith > Fix For: 2.9 > > > I have a TokenStream implementation that joins together multiple sub TokenStreams (i then do additional filtering on top of this, so i can't just have the indexer do the merging) > in 2.4, this worked fine. > once one sub stream was exhausted, i just started using the next stream > however, in 2.9, this is very difficult, and requires copying Term buffers for every token being aggregated > however, if all the sub TokenStreams share the same AttributeSource, and my "concat" TokenStream shares the same AttributeSource, this goes back to being very simple (and very efficient) > So for example, i would like to see the following constructor added to StandardTokenizer: > {code} > public StandardTokenizer(AttributeSource source, Reader input, boolean replaceInvalidAcronym) { > super(source); > ... > } > {code} > would likewise want similar constructors added to all Tokenizer sub classes provided by lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org