lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Rowe <sar...@syr.edu>
Subject Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]
Date Tue, 29 May 2007 17:04:16 GMT
Hi Mohammad,

Mohammad Norouzi wrote:
> [Hoss wrote:]
>> ...are there Persian characters with a category type of SPACE_SEPARATOR,
>> LINE_SEPARATOR, or PARAGRAPH_SEPARATOR ?
> 
> How can I know that?

The Unicode standard's codes[1] for these are:

   SPACE SEPARATOR: Zs
   LINE SEPARATOR: Zl
   PARAGRAPH SEPARATOR: Zp

>From <http://www.unicode.org/Public/4.0-Update/PropList-4.0.0.txt>, the
only characters with these properties are:

   0020       ; White_Space # Zs    SPACE
   00A0       ; White_Space # Zs    NO-BREAK SPACE
   1680       ; White_Space # Zs    OGHAM SPACE MARK
   180E       ; White_Space # Zs    MONGOLIAN VOWEL SEPARATOR
   2000..200A ; White_Space # Zs    EN QUAD..HAIR SPACE
   200B       ; Other_Default_Ignorable_Code_Point # Zs ZERO WIDTH SPACE
   2028       ; White_Space # Zl    LINE SEPARATOR
   2029       ; White_Space # Zp    PARAGRAPH SEPARATOR
   202F       ; White_Space # Zs    NARROW NO-BREAK SPACE
   205F       ; White_Space # Zs    MEDIUM MATHEMATICAL SPACE
   3000       ; White_Space # Zs    IDEOGRAPHIC SPACE

Modern Persian uses Arabic orthography with four additional letters[2]
-- peh, tcheh, jeh, and gaf -- all of which are included in the Unicode
basic Arabic character set.

The Arabic Unicode character ranges are:

   [U+0600 - U+06FF] <http://www.unicode.org/charts/PDF/U0600.pdf>
   [U+0750 - U+077F] <http://www.unicode.org/charts/PDF/U0750.pdf>
   [U+FB50 - U+FC3F] <http://www.unicode.org/charts/PDF/UFB50.pdf>
   [U+FE70 - U+FEFF] <http://www.unicode.org/charts/PDF/UFE70.pdf>

The intersection of the sets { all Arabic characters } and { all Unicode
whitespace characters } is the null set.  Thus, it appears, there are no
Arabic-specific (and hence Persian-specific) whitespace characters in
the Unicode standard.

Steve

[1] Unicode 4.0.0 Character Database - Property value codes:
<http://www.unicode.org/Public/4.0-Update/UCD-4.0.0.html#Property_Values>
[2] http://en.wikipedia.org/wiki/Persian_alphabet

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message