Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 43452 invoked from network); 21 Feb 2008 17:44:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 Feb 2008 17:44:04 -0000 Received: (qmail 58033 invoked by uid 500); 21 Feb 2008 17:43:56 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 57992 invoked by uid 500); 21 Feb 2008 17:43:56 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 57981 invoked by uid 99); 21 Feb 2008 17:43:56 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Feb 2008 09:43:56 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Feb 2008 17:43:18 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 0204C234C048 for ; Thu, 21 Feb 2008 09:43:22 -0800 (PST) Message-ID: <1919387115.1203615802007.JavaMail.jira@brutus> Date: Thu, 21 Feb 2008 09:43:22 -0800 (PST) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1181) Token reuse is not ideal for avoiding array copies In-Reply-To: <682434614.1203384154542.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12571112#action_12571112 ] Michael McCandless commented on LUCENE-1181: -------------------------------------------- {quote} 1. setTermBuffer(char[],int,int) calls resizeTermBuffer(int) which copies the original term text even though it's about to be overwritten. {quote} True, although the cost should be negligible in practice since that copy only occurs if the term buffer isn't already big enough. A very small number of reallocations should occur in practice when a single Token is shared. {quote} 2. setTermBuffer(char[],int,int) copies what you pass in, instead of actually setting the term buffer. {quote} I thought about holding a reference to what was passed in, but, it made me nervous because 1) this may cause alot of excess reallocations (ie if you keep setting a smaller buffer than downstream filters need), and, 2) it makes things sneakier since filters downstream are allowed to alter that buffer directly. It felt safer to have a single "private" buffer in Token. Maybe one possible workaround is to use two Tokens (one temp, one real) and swap which one you are working on for your Unicode normalization on every token? > Token reuse is not ideal for avoiding array copies > -------------------------------------------------- > > Key: LUCENE-1181 > URL: https://issues.apache.org/jira/browse/LUCENE-1181 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Affects Versions: 2.3 > Reporter: Trejkaz > > The way the Token API is currently written results in two unnecessary array copies which could be avoided by changing the way it works. > 1. setTermBuffer(char[],int,int) calls resizeTermBuffer(int) which copies the original term text even though it's about to be overwritten. > #1 should be trivially fixable by introducing a private resizeTermBuffer(int,boolean) where the new boolean parameter specifies whether the existing term data gets copied over or not. > 2. setTermBuffer(char[],int,int) copies what you pass in, instead of actually setting the term buffer. > Setting aside the fact that the setTermBuffer method is misleadingly named, consider a token filter which performs Unicode normalisation on each token. > How it has to be implemented at present: > once: > - create a reusable char[] for storing the normalisation result > every token: > - use getTermBuffer() and getTermLength() to get the buffer and relevant length > - normalise the original string into our temporary buffer (if it isn't big enough, grow the temp buffer size.) > - setTermBuffer(byte[],int,int) - this does an extra copy. > The following sequence would be much better: > once: > - create a reusable char[] for storing the normalisation result > every token: > - use getTermBuffer() and getTermLength() to get the buffer and relevant length > - normalise the original string into our temporary buffer (if it isn't big enough, grow the temp buffer size.) > - setTermBuffer(byte[],int,int) sets in our buffer by reference > - set the term buffer which used to be in the Token such that it becomes our new temp buffer. > The latter sequence results in no copying with the exception of the normalisation itself, which is unavoidable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org