lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Best way to create own version of StandardTokenizer ?
Date Mon, 07 Sep 2009 15:03:47 GMT
On Mon, Sep 7, 2009 at 10:47 AM, Paul Taylor <paul_t100@fastmail.fm> wrote:
> Robert Muir wrote:
>>>
>>> I think we would like to implement the complete unicode rules, so if you
>>> could provide us with some code that would be great.
>>>
>>
>> ok, I will followup... what version of lucene are you using, 2.9?
>>
>> ...
>>
>
> Yes

I will update LUCENE-1488 with the latest code so you can steal the
ICUTransformFilter from there.

> I'm obviously misunderstanding I thought that Halfwidth  was an encoding to
> allow storing the most common Chinese characters in a single byte, therefore
> the charcters would be read as different characters if you assumed they were
> using the HalfWidth Encoding rather than Latin Encoding. But are you saying
> Halfwidth characters are actually valid Unicode characters with their own
> distinct unicode value   so can just  use a CharFilter again to map these,
> please confirm.

yes, fullwidth latin forms are distinct characters that have a different width:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:East_Asian_Width=Fullwidth:]

so yes, you can use charfilter to map these to their standard latin forms.

beware though, there is a similar issue with halfwidth characters:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:East_Asian_Width=Halfwidth:]
example, ナ is halfwidth for the standard ナ
so you might want to include mappings for those as well.

the reason i brought up normalization is because this issue (width) is
a subset of things normalization can help with.
if you click on some of the characters in the two sets i provided you
will notice properties like 'toNFKC' containing the 'standardized'
form.

if in the future, you run into trouble with things in other languages
that aren't matching as expected,
because they aren't being considered the "same" when perhaps they
should, then a more general approach would be applying Unicode
normalization form NFKC in a TokenFilter.

-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message