lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Armbrust <>
Subject WhiteSpace Tokenizer question
Date Tue, 23 Aug 2005 15:32:13 GMT
I wrote a slightly modified version of the WhiteSpaceTokenizer that 
allows me to treat other characters as whitespace.  My thought was that 
this would be an easy way to make it tokenize on characters such as "-".

My tokenizer looks like this:

public class CustomWhiteSpaceTokenizer extends CharTokenizer

    protected boolean isTokenChar(char c)
        if (Character.isWhitespace(c) || whiteSpaceChars_.contains(new 
            return false;
            return true;

<snip other stuff>

When I use my Analyzer which uses this tokenizer in the QueryParser with 
the character "-" defined as whitespace, the following query gets parsed 
like this:

"title:(john  a) body:(john  a) " -> (title:john title:a) (body:john body:a)

which is what I expect.  But then the following query:

"title:(john--a) body:(john--a) " -> title:"john a" body:"john a"

Isn't what I want.  I can't seem to figure out why it is behaving 
differently on these characters (space vs hyphen) when I am specifying 
them both as a non-token.

This is with the svn trunk as of yesterday.
Any help appreciated,



Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message