lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From András Péteri <>
Subject Re: Quiz question: Which Character.isSpaceChar but not isWhitespace?
Date Sun, 01 Nov 2015 23:14:58 GMT
Hi David,

While I agree on the quirkiness, at least it's documented (and the method
is probably kept as-is for backwards compatibility reasons); the first
bullet point of the corresponding Java SE 7 page says: "It is a Unicode
space character [...] but is not also a non-breaking space" [1].

You can still override the isTokenChar method of WhitespaceTokenizer or
CharTokenizer in a subclass to exclude an extra set of characters from the
allowed range. If you are using Google's Guava library in your project,
they have a character matching predicate class which follows the Unicode
specification more closely [2]; this can also be used in isTokenChar as a


On Fri, Oct 30, 2015 at 9:10 PM, <> wrote:

> One would think that all “space characters” are by definition
> “whitespace”.  Not true!:
> So I’m working on an app where I can no longer use WhitespaceTokenizer
> since I need to check for isSpacheChar *OR* isWhitespace.  Alternatively I
> could use MappingCharFilter, I realize.
> This had trickle-down effects on a search platform I’m working on that was
> triggered by a user’s search.  It’s caused all sorts of head-scratching
> till we discovered what’s going on.
> Craziness.
> ~ David
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: | Book:

András Péteri

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message