Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C633DD64D for ; Thu, 1 Nov 2012 23:46:15 +0000 (UTC) Received: (qmail 30792 invoked by uid 500); 1 Nov 2012 23:46:13 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 30736 invoked by uid 500); 1 Nov 2012 23:46:13 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 30728 invoked by uid 99); 1 Nov 2012 23:46:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Nov 2012 23:46:13 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rcmuir@gmail.com designates 209.85.216.176 as permitted sender) Received: from [209.85.216.176] (HELO mail-qc0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Nov 2012 23:46:08 +0000 Received: by mail-qc0-f176.google.com with SMTP id n41so2395341qco.35 for ; Thu, 01 Nov 2012 16:45:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=E8eK2oTsicjabzRyYcKk5Kg7Po1WQMuwP5zq24Lqm+U=; b=XW/zyO9Wa9iPGjUgHVuJen138d64Hp/EVryiGSXXvwVGAAV21JRD4q0HlWYjR8ZMSn 4KQU5Yf7rc4MFeeLpQLiwuathFn0ABKnEZgtMJUVjeCnhRY/Zm7GxnSszqotVILDcvIo 9c6/6lW482pgw402HqIk5N1/aSvZbJ9jAl2tTvQgtjEbbyNSqbyqkJw0zVSLSKPNcQYZ rzkPu+JPRe9ePNSPE3016s5ywv07iFAqswD72Ev2kXhJjqpj08anKc2mgZRe+w0H/DtT e+YlOZYpL3eCCtIZOPK4xZKg+7buVmncUHit0BrJ8FpGfbENA/6gK/DUVAwLnE6rx5kX jksA== Received: by 10.224.39.83 with SMTP id f19mr134823qae.76.1351813547435; Thu, 01 Nov 2012 16:45:47 -0700 (PDT) MIME-Version: 1.0 Received: by 10.49.105.168 with HTTP; Thu, 1 Nov 2012 16:45:27 -0700 (PDT) In-Reply-To: <5093065F.3040800@getrailo.org> References: <5093065F.3040800@getrailo.org> From: Robert Muir Date: Thu, 1 Nov 2012 19:45:27 -0400 Message-ID: Subject: Re: tokenizer's tokens To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org this is intentional (since you have a bug in your code). you need to call reset(): see the tokenstream contract, step 2: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html On Thu, Nov 1, 2012 at 7:31 PM, Igal @ getRailo.org wrote: > I'm trying to write a very simple method to show the different tokens that > come out of a tokenizer. when I call WhitespaceTokenizer's (or > LetterTokenizer's) incrementToken() method though I get an > ArrayIndexOutOfBoundsException (see below) > > any ideas? > > p.s. if I use StandardTokenizer it works. > > > java.lang.ArrayIndexOutOfBoundsException: -1 > at java.lang.Character.codePointAtImpl(Character.java:4739) > at java.lang.Character.codePointAt(Character.java:4702) > at > org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164) > at > org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166) > at test.Test1.tokenize(Test1.java:46) > at test.Test1.main(Test1.java:139) > > > class Test1 { > > static Version v = Version.LUCENE_40; > > > static void tokenize( String s ) throws IOException { > > Reader r = new StringReader( s ); > > Tokenizer t = new WhitespaceTokenizer( v, r ); > > CharTermAttribute attrTerm = t.getAttribute( > CharTermAttribute.class ); > > while ( t.incrementToken() ) { > > String term = attrTerm.toString(); > > System.out.println( term ); > } > } > > > public static void main( String[] args ) throws IOException { > > String[] text = { > > "The quick brown fox jumps over the lazy dog", > "Only the fool would take trouble to verify that his sentence > was composed of ten a's, three b's, four c's, four d's, forty-six e's, > sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four > m's, twenty-five n's, twenty-four o's, five p's, sixteen r's, forty-one s's, > thirty-seven t's, ten u's, eight v's, eight w's, four x's, eleven y's, > twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but > not least, a single!", > > }; > > for ( String s : text ) > tokenize( s ); > > } > > } > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org