lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: Some questions about StandardTokenizer and UNICODE Regular Expressions
Date Thu, 16 Jun 2016 15:01:18 GMT
Hi dr,

Unicode’s character property model is described here: <http://unicode.org/reports/tr23/>.

Wikipedia has a description of Unicode character properties: <https://en.wikipedia.org/wiki/Unicode_character_property>

JFlex allows you to refer to the set of characters that have a given Unicode property using
the \p{PropertyName} syntax.  In the case of the HangulEx macro:

  HangulEx = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] [\p{WB:Format}\p{WB:Extend}]*

This matches a Hangul script character (\p{Script:Hangul})[1] that also either has the Word-Break
property “ALetter” or “Hebrew_Letter”, followed by zero or more characters that have
either the “Format” or “Extend” Word-Break properties[2].  

Some helpful resources:

* Character code charts organized by Unicode block: <http://www.unicode.org/charts/>
* UnicodeSet utility: <http://unicode.org/cldr/utility/list-unicodeset.jsp> - note that
this utility supports a different regex syntax from JFlex - click on the “help” link for
more info.

[1] All characters matching \p{Script:Hangul}: <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Script:Hangul}>
[2] Word-Break properties, which in JFlex can be referred to with the abbreviation “WB:”
in \p{WB:property-name}, are described in the table at <http://www.unicode.org/reports/tr29/#Default_Word_Boundaries>.

--
Steve
www.lucidworks.com


> On Jun 16, 2016, at 7:01 AM, dr <bforevdr@126.com> wrote:
> 
> Hi guys
>   Currenly, I'm looking into the rules of StandardTokenizer, but met some probleam.
>    As the docs says, StandardTokenizer implements the Word Break rules from the Unicode
Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Also it is generated
by JFlex, a lexer/scanner generator. 
> 
>   In StandardTokenizerImpl.jflex, the Regular Expressions is expressed as follows
>     "
>    HangulEx            = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]]
[\p{WB:Format}\p{WB:Extend}]*
> HebrewOrALetterEx   = [\p{WB:HebrewLetter}\p{WB:ALetter}]                       [\p{WB:Format}\p{WB:Extend}]*
> NumericEx           = [\p{WB:Numeric}[\p{Blk:HalfAndFullForms}&&\p{Nd}]]    
   [\p{WB:Format}\p{WB:Extend}]*
> KatakanaEx          = \p{WB:Katakana}                                           [\p{WB:Format}\p{WB:Extend}]*

> MidLetterEx         = [\p{WB:MidLetter}\p{WB:MidNumLet}\p{WB:SingleQuote}]      [\p{WB:Format}\p{WB:Extend}]*

> ......
> "
> What does them mean, like HangulEx  or NumericEx  ?
> In ClassicTokenizerImpl.jflex, for num, it is expressed like this
> "
> P           = ("_"|"-"|"/"|"."|",")
> NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
>           | {HAS_DIGIT} {P} {ALPHANUM}
>           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
>           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
> "
> This is easy to understand. '29' , '29.3', '29-3', '29_3' will all be tokenized as NUMBERS.
> 
> 
> 
> I read the Unicode Standard Annex #29  UNICODE TEXT SEGMENTATION,  Unicode Standard Annex
#18 UNICODE REGULAR EXPRESSIONS, Unicode Standard Annex #44
> UNICODE CHARACTER DATABASE, but they include too much information and hard to understand.
> Anyone has some reference of these kinds of Regular Expressions or tell me where to find
the meanings of these UNICODE Regular Expressions
> 
> 
> Thanks.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message