lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Igal @" <>
Subject tokenizer's tokens
Date Thu, 01 Nov 2012 23:31:43 GMT
I'm trying to write a very simple method to show the different tokens 
that come out of a tokenizer.  when I call WhitespaceTokenizer's (or 
LetterTokenizer's) incrementToken() method though I get an 
ArrayIndexOutOfBoundsException (see below)

any ideas?

p.s.  if I use StandardTokenizer it works.

java.lang.ArrayIndexOutOfBoundsException: -1
     at java.lang.Character.codePointAtImpl(
     at java.lang.Character.codePointAt(
     at test.Test1.tokenize(
     at test.Test1.main(

class Test1 {

     static Version v = Version.LUCENE_40;

     static void tokenize( String s ) throws IOException {

         Reader r = new StringReader( s );

         Tokenizer t = new WhitespaceTokenizer( v, r );

         CharTermAttribute   attrTerm = t.getAttribute( 
CharTermAttribute.class );

         while ( t.incrementToken() ) {

             String term = attrTerm.toString();

             System.out.println( term );

     public static void main( String[] args ) throws IOException {

         String[] text = {

             "The quick brown fox jumps over the lazy dog",
             "Only the fool would take trouble to verify that his 
sentence was composed of ten a's, three b's, four c's, four d's, 
forty-six e's, sixteen f's, four g's, thirteen h's, fifteen i's, two 
k's, nine l's, four m's, twenty-five n's, twenty-four o's, five p's, 
sixteen r's, forty-one s's, thirty-seven t's, ten u's, eight v's, eight 
w's, four x's, eleven y's, twenty-seven commas, twenty-three 
apostrophes, seven hyphens and, last but not least, a single!",


         for ( String s : text )
             tokenize( s );



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message