Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 78989 invoked from network); 20 Jul 2007 18:53:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 20 Jul 2007 18:53:07 -0000 Received: (qmail 68655 invoked by uid 500); 20 Jul 2007 18:53:07 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 68268 invoked by uid 500); 20 Jul 2007 18:53:05 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 68257 invoked by uid 99); 20 Jul 2007 18:53:05 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Jul 2007 11:53:05 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of yseeley@gmail.com designates 64.233.162.233 as permitted sender) Received: from [64.233.162.233] (HELO nz-out-0506.google.com) (64.233.162.233) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Jul 2007 11:53:03 -0700 Received: by nz-out-0506.google.com with SMTP id i28so754840nzi for ; Fri, 20 Jul 2007 11:52:42 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=ZNZRa24Dqz2+TfMsbfjSZjM5BFK+JPKyn6kWgYmY45TZ6LBRMZYJCze7kWqLCe8rkJn2cbhoueEwb00rpaCRIu7vF8/RIi3NS3I4E5zGCBxYj2n+lvMl+2VpgtjidKr+zqua8b66bS0qVFFMtnqcWxovnBh91t3Lys8gSX03TxQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=fdUjVioX6/aBRaGWkiFdH8WQBsyK792aH9OP7aUTkbc7dKNvN7NRX3ne/XCYz9Qk8LYbDnq2cSMKIO3SmMLXHweX3CVziJr4P0pgyXrSnOLmmQ/107h1Yu7qFS58mcFDwFM779GJ5Suu8Lq4IDxGVe/1yuK7d3XWiQkmQV5jX6U= Received: by 10.142.154.20 with SMTP id b20mr58631wfe.1184957562076; Fri, 20 Jul 2007 11:52:42 -0700 (PDT) Received: by 10.143.3.1 with HTTP; Fri, 20 Jul 2007 11:52:42 -0700 (PDT) Message-ID: Date: Fri, 20 Jul 2007 14:52:42 -0400 From: "Yonik Seeley" Sender: yseeley@gmail.com To: java-dev@lucene.apache.org Subject: Re: Token termBuffer issues In-Reply-To: <1184891270.16597.1201090809@webmail.messagingengine.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <1184891270.16597.1201090809@webmail.messagingengine.com> X-Google-Sender-Auth: 8b6dc420dd2a1b63 X-Virus-Checked: Checked by ClamAV on apache.org On 7/19/07, Michael McCandless wrote: > "Yonik Seeley" wrote: > > I had previously missed the changes to Token that add support for > > using an array (termBuffer): > > > > + // For better indexing speed, use termBuffer (and > > + // termBufferOffset/termBufferLength) instead of termText > > + // to save new'ing a String per token > > + char[] termBuffer; > > + int termBufferOffset; > > + int termBufferLength; > > > > While I think this approach would have been best to start off with > > rather than String, > > I'm concerned that it will do little more than add overhead at this > > point, resulting in slower code, not faster. > > > > - If any tokenizer or token filter tries setting the termBuffer, any > > downstream components would need to check for both. It could be made > > backward compatible by constructing a string on demand, but that will > > really slow things down, unless the whole chain is converted to only > > using the char[] somehow. > > Good point: if your analyzer/tokenizer produces char[] tokens then > your downstream filters would have to accept char[] tokens. > > I think on-demand constructing a String (and saving it as termText) > would be an OK solution? Why would that be slower than having to make > a String in the first place (if we didn't have the char[] API)? It's > at least graceful degradation. It's the rule rather than the exception though. Pretty much everything is based on String. > > - It doesn't look like the indexing code currently pays any attention > > to the char[], right? > > It does, in DocumentsWriter.addPosition(). Ah, thanks. > > - What if both the String and char[] are set? A filter that doesn't > > know better sets the String... this doesn't clear the char[] > > currently, should it? > > Currently the char[] wins, but good point: seems like each setter > should null out the other one? Certainly the String setter should null the char[] (that's the only way to keep back compatibility), and probably vice-versa. Note that there are many existing filters that directly access and manipulate the package protected String termText. These will need to be changed. -Yonik --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org