Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 67789 invoked from network); 6 Aug 2009 16:17:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Aug 2009 16:17:07 -0000 Received: (qmail 131 invoked by uid 500); 6 Aug 2009 16:17:13 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 99954 invoked by uid 500); 6 Aug 2009 16:17:13 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 99946 invoked by uid 99); 6 Aug 2009 16:17:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Aug 2009 16:17:13 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [85.25.71.29] (HELO mail.troja.net) (85.25.71.29) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Aug 2009 16:17:01 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.troja.net (Postfix) with ESMTP id B2A83D36002 for ; Thu, 6 Aug 2009 18:16:41 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mail.troja.net Received: from mail.troja.net ([127.0.0.1]) by localhost (megaira.troja.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HCLRvl0Iv9dt for ; Thu, 6 Aug 2009 18:16:29 +0200 (CEST) Received: from VEGA (port-83-236-62-3.dynamic.qsc.de [83.236.62.3]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.troja.net (Postfix) with ESMTPSA id 8946545E76C for ; Thu, 6 Aug 2009 18:16:29 +0200 (CEST) From: "Uwe Schindler" To: References: <4A7AE537.2080705@gmail.com> <8f0ad1f30908060758gd6adca5q797c460e9eeedf5f@mail.gmail.com> <8f0ad1f30908060814s5fba9308x78cc9ff98e754cd8@mail.gmail.com> <4A7AFC2B.8040706@gmail.com> Subject: RE: Issue with Solr TokenFilter and the new TokenStream API Date: Thu, 6 Aug 2009 18:16:30 +0200 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-reply-to: <4A7AFC2B.8040706@gmail.com> Thread-index: AcoWrfdwbLWPxaIfSkSOyJKKqSaJGQAAsrvQ X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5579 X-Virus-Checked: Checked by ClamAV on apache.org Thanks, we are always here to help :-) > Test passes with this patch - thanks a lot Robert ! I was going to ask > you to create a solr issue, but I see you already have, thanks! > > No need to create a test I think - put in the new Lucene jars and it > fails, so likely thats good enough. Though it is spooky that the test > passed without the new jars See LUCENE-1762, I think this problems comes from there. I would strongly suggest to create a testcase with better lists of terms of different length and so on. > so perhaps a more targeted test is > warranted after all. More tests are always better :-) When I created some tests locally to test something (even when they are strange), I often simply add them to Lucene's testcases. > - Mark > > Robert Muir wrote: > > Index: > src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java > > =================================================================== > > --- src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java > (revision > > 778975) > > +++ src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java > (working > > copy) > > @@ -209,7 +209,7 @@ > > //make a backup in case we exceed the word count > > System.arraycopy(termBuffer, 0, backup, 0, termBufferLength); > > } > > - if (termBuffer.length < factory.maxTokenLength) { > > + if (termBufferLength < factory.maxTokenLength) { > > int wordCount = 0; > > > > int lastWordStart = 0; > > @@ -226,8 +226,8 @@ > > } > > > > // process the last word > > - if (lastWordStart < termBuffer.length) { > > - factory.processWord(termBuffer, lastWordStart, > > termBuffer.length - lastWordStart, wordCount++); > > + if (lastWordStart < termBufferLength) { > > + factory.processWord(termBuffer, lastWordStart, > > termBufferLength - lastWordStart, wordCount++); > > } > > > > if (wordCount > factory.maxWordCount) { > > > > > > On Thu, Aug 6, 2009 at 10:58 AM, Robert Muir wrote: > > > >> Mark, I looked at this and think it might be unrelated to tokenstreams. > >> > >> I think the length argument being provided to processWord(char[] > >> buffer, int offset, int length, int wordCount) in that filter might be > >> incorrectly calculated. > >> This is the method that checks the keep list. > >> > >> (There is trailing trash on the end of tokens, even with the previous > >> version of lucene in Solr). > >> It just so happens the tokens with trailing trash were ones that were > >> keep words in the previous version, so the test didnt fail. > >> > >> different tokens have trailing trash in the current version > >> (specifically some of the "the" tokens), so its failing now. > >> > >> > >> On Thu, Aug 6, 2009 at 10:14 AM, Mark Miller > wrote: > >> > >>> I think there is an issue here, but I didn't follow the TokenStream > >>> improvements very closely. > >>> > >>> In Solr, CapitalizationFilterFactory has a CharArray set that it loads > up > >>> with keep words - it then checks (with the old TokenStream API) each > token > >>> (char array) to see if it should keep it. I think because of the > cloning > >>> going on in next, this breaks and you can't match anything in the keep > set. > >>> Does that make sense? > >>> > >>> -- > >>> - Mark > >>> > >>> http://www.lucidimagination.com > >>> > >>> > >>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > >>> For additional commands, e-mail: java-dev-help@lucene.apache.org > >>> > >>> > >>> > >> > >> -- > >> Robert Muir > >> rcmuir@gmail.com > >> > >> > > > > > > > > > > > -- > - Mark > > http://www.lucidimagination.com > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org