Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 85771 invoked from network); 10 Sep 2009 18:25:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Sep 2009 18:25:23 -0000 Received: (qmail 32064 invoked by uid 500); 10 Sep 2009 18:25:22 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 31987 invoked by uid 500); 10 Sep 2009 18:25:22 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 31975 invoked by uid 99); 10 Sep 2009 18:25:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Sep 2009 18:25:22 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Sep 2009 18:25:18 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 7C0FE234C004 for ; Thu, 10 Sep 2009 11:24:57 -0700 (PDT) Message-ID: <1277987536.1252607097493.JavaMail.jira@brutus> Date: Thu, 10 Sep 2009 11:24:57 -0700 (PDT) From: "Uwe Schindler (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1906) Problem with CharStream and Tokenizers with custom reset(Reader) method In-Reply-To: <361159201.1252596839904.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753750#action_12753750 ] Uwe Schindler commented on LUCENE-1906: --------------------------------------- bq. Yes, it's relatively fast, but it's per-token too. It is once per token. But you do not need to wrap the input Reader using CharReader if you do not want to use CharFilters. If you wrap each call to Reader by CharReader you have a larger overhead (one additional method call per char read, if you tokenize using Reader.read()!). bq. Hmmm, I had missed that 2.9 required a recompile. In that case it doesn't seem like there is any additional back compat breakage and thus the correct fix would be Uwe's first patch? A recompile is only needed is rare caces (if you override Scorers and so on). If you do not do any very-special Lucene usages, it works without recompiling. In my opinion, e.g. external language Tokenizer-Packages (as Michael Busch calls them) without source code would not work. This example is always brought by Michael. > Problem with CharStream and Tokenizers with custom reset(Reader) method > ----------------------------------------------------------------------- > > Key: LUCENE-1906 > URL: https://issues.apache.org/jira/browse/LUCENE-1906 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis > Affects Versions: 2.9 > Reporter: Uwe Schindler > Assignee: Uwe Schindler > Priority: Blocker > Fix For: 2.9 > > Attachments: backwards-break.patch, LUCENE-1906.patch, LUCENE-1906.patch, LUCENE-1906_contrib.patch > > > When reviewing the new CharStream code added to Tokenizers, I found a > serious problem with backwards compatibility and other Tokenizers, that do > not override reset(CharStream). > The problem is, that e.g. CharTokenizer only overrides reset(Reader): > {code} > public void reset(Reader input) throws IOException { > super.reset(input); > bufferIndex = 0; > offset = 0; > dataLen = 0; > } > {code} > If you reset such a Tokenizer with another CharStream (not a Reader), this > method will never be called and breaking the whole Tokenizer. > As CharStream extends Reader, I propose to remove this reset(CharStream > method) and simply do an instanceof check to detect if the supplied Reader > is no CharStream and wrap it. We could also remove the extra ctor (because > most Tokenizers have no support for passing CharStreams). If the ctor also > checks with instanceof and warps as needed the code is backwards compatible > and we do not need to add additional ctors in subclasses. > As this instanceof check is always done in CharReader.get() why not remove > ctor(CharStream) and reset(CharStream) completely? > Any thoughts? > I would like to fix this somehow before RC4, I'm, sorry :( -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org