lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: tokenizer's tokens
Date Thu, 01 Nov 2012 23:45:27 GMT
this is intentional (since you have a bug in your code).

you need to call reset(): see the tokenstream contract, step 2:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html

On Thu, Nov 1, 2012 at 7:31 PM, Igal @ getRailo.org <igal@getrailo.org> wrote:
> I'm trying to write a very simple method to show the different tokens that
> come out of a tokenizer.  when I call WhitespaceTokenizer's (or
> LetterTokenizer's) incrementToken() method though I get an
> ArrayIndexOutOfBoundsException (see below)
>
> any ideas?
>
> p.s.  if I use StandardTokenizer it works.
>
>
> java.lang.ArrayIndexOutOfBoundsException: -1
>     at java.lang.Character.codePointAtImpl(Character.java:4739)
>     at java.lang.Character.codePointAt(Character.java:4702)
>     at
> org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164)
>     at
> org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166)
>     at test.Test1.tokenize(Test1.java:46)
>     at test.Test1.main(Test1.java:139)
>
>
> class Test1 {
>
>     static Version v = Version.LUCENE_40;
>
>
>     static void tokenize( String s ) throws IOException {
>
>         Reader r = new StringReader( s );
>
>         Tokenizer t = new WhitespaceTokenizer( v, r );
>
>         CharTermAttribute   attrTerm = t.getAttribute(
> CharTermAttribute.class );
>
>         while ( t.incrementToken() ) {
>
>             String term = attrTerm.toString();
>
>             System.out.println( term );
>         }
>     }
>
>
>     public static void main( String[] args ) throws IOException {
>
>         String[] text = {
>
>             "The quick brown fox jumps over the lazy dog",
>             "Only the fool would take trouble to verify that his sentence
> was composed of ten a's, three b's, four c's, four d's, forty-six e's,
> sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four
> m's, twenty-five n's, twenty-four o's, five p's, sixteen r's, forty-one s's,
> thirty-seven t's, ten u's, eight v's, eight w's, four x's, eleven y's,
> twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but
> not least, a single!",
>
>         };
>
>         for ( String s : text )
>             tokenize( s );
>
>     }
>
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message