Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 31255 invoked from network); 20 Nov 2007 19:52:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 20 Nov 2007 19:52:18 -0000 Received: (qmail 20114 invoked by uid 500); 20 Nov 2007 19:52:03 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 20071 invoked by uid 500); 20 Nov 2007 19:52:02 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 20060 invoked by uid 99); 20 Nov 2007 19:52:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Nov 2007 11:52:02 -0800 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Nov 2007 19:52:04 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 0D9F67141F1 for ; Tue, 20 Nov 2007 11:51:43 -0800 (PST) Message-ID: <20336622.1195588303053.JavaMail.jira@brutus> Date: Tue, 20 Nov 2007 11:51:43 -0800 (PST) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains In-Reply-To: <3525244.1195570123149.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1063: --------------------------------------- Attachment: LUCENE-1063.patch Attached patch w/ unit test showing the issue, plus the fix. The fix was actually simpler than I thought: we don't have to make a new Token(); instead we just have to copy over the fields to the Token that was passed in. So the performance hit is less that I thought it'd be (copy instead of new/GC). I also strengthened the javadocs on the reuse & non-reuse APIs. All tests pass. > Token re-use API breaks back compatibility in certain TokenStream chains > ------------------------------------------------------------------------ > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis > Affects Versions: 2.3 > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org