lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]
Date Thu, 24 May 2007 07:45:31 GMT

:     return !Character.isWhitespace(c);

: And my class override that method as this:

:     return !((int)c==32);

in my opinion that's a pretty naive change ... it won't split on tab
characters or newlines ... even for trivial ASCII text that's probably not
what you want.

: I think the Character.isWhitespace consider the unicodes as space :))
: so everything will mess up.

every character in java is a unicode character, so your comment doesn't
really make sense to me ... the javadocs are very clear about the
definition of "whitesace" in java...

    * It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or
      PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0',
      '\u2007', '\u202F').
    * It is '\u0009', HORIZONTAL TABULATION.
    * It is '\u000A', LINE FEED.
    * It is '\u000B', VERTICAL TABULATION.
    * It is '\u000C', FORM FEED.
    * It is '\u000D', CARRIAGE RETURN.
    * It is '\u001C', FILE SEPARATOR.
    * It is '\u001D', GROUP SEPARATOR.
    * It is '\u001E', RECORD SEPARATOR.
    * It is '\u001F', UNIT SEPARATOR.

...are there Persian characters with a category type of SPACE_SEPARATOR,
LINE_SEPARATOR, or PARAGRAPH_SEPARATOR ?


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message