lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dr <bfore...@126.com>
Subject Some questions about StandardTokenizer and UNICODE Regular Expressions
Date Thu, 16 Jun 2016 11:01:42 GMT
Hi guys
   Currenly, I'm looking into the rules of StandardTokenizer, but met some probleam.
    As the docs says, StandardTokenizer implements the Word Break rules from the Unicode Text
Segmentation algorithm, as specified in Unicode Standard Annex #29. Also it is generated by
JFlex, a lexer/scanner generator. 

   In StandardTokenizerImpl.jflex, the Regular Expressions is expressed as follows
     "
    HangulEx            = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]]
[\p{WB:Format}\p{WB:Extend}]*
HebrewOrALetterEx   = [\p{WB:HebrewLetter}\p{WB:ALetter}]                       [\p{WB:Format}\p{WB:Extend}]*
NumericEx           = [\p{WB:Numeric}[\p{Blk:HalfAndFullForms}&&\p{Nd}]]        [\p{WB:Format}\p{WB:Extend}]*
KatakanaEx          = \p{WB:Katakana}                                           [\p{WB:Format}\p{WB:Extend}]*

MidLetterEx         = [\p{WB:MidLetter}\p{WB:MidNumLet}\p{WB:SingleQuote}]      [\p{WB:Format}\p{WB:Extend}]*

......
"
What does them mean, like HangulEx  or NumericEx  ?
In ClassicTokenizerImpl.jflex, for num, it is expressed like this
"
P           = ("_"|"-"|"/"|"."|",")
NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
           | {HAS_DIGIT} {P} {ALPHANUM}
           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
"
This is easy to understand. '29' , '29.3', '29-3', '29_3' will all be tokenized as NUMBERS.



 I read the Unicode Standard Annex #29  UNICODE TEXT SEGMENTATION,  Unicode Standard Annex
#18 UNICODE REGULAR EXPRESSIONS, Unicode Standard Annex #44
UNICODE CHARACTER DATABASE, but they include too much information and hard to understand.
Anyone has some reference of these kinds of Regular Expressions or tell me where to find the
meanings of these UNICODE Regular Expressions


Thanks.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message