lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <s...@elyograg.org>
Subject Re: ICUTokenizer acting very strangely with oriental characters
Date Wed, 13 Aug 2014 17:53:51 GMT
On 8/12/2014 9:13 PM, Steve Rowe wrote:
> In the table below, the "IsSameS" (is same script) and "SBreak?" (script
> break = not IsSameS) decisions are based on what I mentioned in my previous
> message, and the "WBreak" (word break) decision is based on UAX#29 word
> break rules:
>
> Char    Code Point   Script        IsSameS?    SBreak?  WBreak?
> ------    --------------   -------        -------------    ---------
> -----------
> 治        U+6CBB       Han          Yes              No            Yes
> ]          U+005D        Common   Yes              No            Yes
> ,          U+002C        Common   Yes              No            Yes
> 1         U+0031         Common   --                 --              --
>
> First, script boundaries are found and used as token boundaries - in the
> above case, no script boundary is found between "治" and "1" - and then
> UAX#29 word break rules are used to find token boundaries inbetween script
> boundaries - in the above case, there are word boundaries between each
> character, but ICUTokenizer throws away punctuation-only sequences between
> token boundaries.

What should we use as a dividing character for situations like this? 
Should we tell our customer that they can't start keywords like this
(for searching/filtering) with a number?

Thanks,
Shawn


Mime
View raw message