lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dr <bfore...@126.com>
Subject Re:Re: Some questions about StandardTokenizer and UNICODE Regular Expressions
Date Thu, 16 Jun 2016 15:15:46 GMT

Thank you so much, Steve. Your reply is very helpful.







At 2016-06-16 23:01:18, "Steve Rowe" <sarowe@gmail.com> wrote:
>Hi dr,
>
>Unicode’s character property model is described here: <http://unicode.org/reports/tr23/>.
>
>Wikipedia has a description of Unicode character properties: <https://en.wikipedia.org/wiki/Unicode_character_property>
>
>JFlex allows you to refer to the set of characters that have a given Unicode property
using the \p{PropertyName} syntax.  In the case of the HangulEx macro:
>
>  HangulEx = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] [\p{WB:Format}\p{WB:Extend}]*
>
>This matches a Hangul script character (\p{Script:Hangul})[1] that also either has the
Word-Break property “ALetter” or “Hebrew_Letter”, followed by zero or more characters
that have either the “Format” or “Extend” Word-Break properties[2].  
>
>Some helpful resources:
>
>* Character code charts organized by Unicode block: <http://www.unicode.org/charts/>
>* UnicodeSet utility: <http://unicode.org/cldr/utility/list-unicodeset.jsp> - note
that this utility supports a different regex syntax from JFlex - click on the “help” link
for more info.
>
>[1] All characters matching \p{Script:Hangul}: <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Script:Hangul}>
>[2] Word-Break properties, which in JFlex can be referred to with the abbreviation “WB:”
in \p{WB:property-name}, are described in the table at <http://www.unicode.org/reports/tr29/#Default_Word_Boundaries>.
>
>--
>Steve
>www.lucidworks.com
>
>
>> On Jun 16, 2016, at 7:01 AM, dr <bforevdr@126.com> wrote:
>> 
>> Hi guys
>>   Currenly, I'm looking into the rules of StandardTokenizer, but met some probleam.
>>    As the docs says, StandardTokenizer implements the Word Break rules from the Unicode
Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Also it is generated
by JFlex, a lexer/scanner generator. 
>> 
>>   In StandardTokenizerImpl.jflex, the Regular Expressions is expressed as follows
>>     "
>>    HangulEx            = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]]
[\p{WB:Format}\p{WB:Extend}]*
>> HebrewOrALetterEx   = [\p{WB:HebrewLetter}\p{WB:ALetter}]                       [\p{WB:Format}\p{WB:Extend}]*
>> NumericEx           = [\p{WB:Numeric}[\p{Blk:HalfAndFullForms}&&\p{Nd}]]
       [\p{WB:Format}\p{WB:Extend}]*
>> KatakanaEx          = \p{WB:Katakana}                                           [\p{WB:Format}\p{WB:Extend}]*

>> MidLetterEx         = [\p{WB:MidLetter}\p{WB:MidNumLet}\p{WB:SingleQuote}]      [\p{WB:Format}\p{WB:Extend}]*

>> ......
>> "
>> What does them mean, like HangulEx  or NumericEx  ?
>> In ClassicTokenizerImpl.jflex, for num, it is expressed like this
>> "
>> P           = ("_"|"-"|"/"|"."|",")
>> NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
>>           | {HAS_DIGIT} {P} {ALPHANUM}
>>           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
>>           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>>           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>>           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
>> "
>> This is easy to understand. '29' , '29.3', '29-3', '29_3' will all be tokenized as
NUMBERS.
>> 
>> 
>> 
>> I read the Unicode Standard Annex #29  UNICODE TEXT SEGMENTATION,  Unicode Standard
Annex #18 UNICODE REGULAR EXPRESSIONS, Unicode Standard Annex #44
>> UNICODE CHARACTER DATABASE, but they include too much information and hard to understand.
>> Anyone has some reference of these kinds of Regular Expressions or tell me where
to find the meanings of these UNICODE Regular Expressions
>> 
>> 
>> Thanks.
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message