lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vanlerberghe, Luc" <>
Subject RE: WhiteSpace Tokenizer question
Date Tue, 23 Aug 2005 16:26:37 GMT
The query string is first parsed by QueryParser and what it believes to
be single terms are then passed on to your analyzer.  QueryParser only
considers space, tab, \n and \r to be white space (See QueryParser.jj)

QueryParser itself is not aware that '-' should be treated as white
space so in your second example it treats john--a as a single term.
Your analyzer then converts this into two tokens and that is then
treated as a phrase query.

title:(john- -a) body:(john- -a)

I would expect the result to be:
(title:john title:a) (body:john body:a)
since QueryParser will break on the extra spaces now and your analyzer
will strip the remaining '-' afterwards.

I guess the best solution is to convert all characters that you consider
to be white-space to real spaces before passing the query string to


-----Original Message-----
From: Dan Armbrust [] 
Sent: dinsdag 23 augustus 2005 17:32
Subject: WhiteSpace Tokenizer question

I wrote a slightly modified version of the WhiteSpaceTokenizer that 
allows me to treat other characters as whitespace.  My thought was that 
this would be an easy way to make it tokenize on characters such as "-".

My tokenizer looks like this:

public class CustomWhiteSpaceTokenizer extends CharTokenizer

    protected boolean isTokenChar(char c)
        if (Character.isWhitespace(c) || whiteSpaceChars_.contains(new 
            return false;
            return true;

<snip other stuff>

When I use my Analyzer which uses this tokenizer in the QueryParser with

the character "-" defined as whitespace, the following query gets parsed

like this:

"title:(john  a) body:(john  a) " -> (title:john title:a) (body:john

which is what I expect.  But then the following query:

"title:(john--a) body:(john--a) " -> title:"john a" body:"john a"

Isn't what I want.  I can't seem to figure out why it is behaving 
differently on these characters (space vs hyphen) when I am specifying 
them both as a non-token.

This is with the svn trunk as of yesterday.
Any help appreciated,



Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message