lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
Date Thu, 11 Oct 2012 15:11:03 GMT


Christian Moen commented on LUCENE-3922:

Thanks, Kazu.

I'm aware of the issue and the thinking is to rework this as a {{TokenFilter}} and use anchoring
options with surrounding tokens to decide if normalisation should take place, i.e. if the
preceding token is ¥ or the following token is 円 in the case of normalising prices.

It might also be helpful to look into using POS-info for this to benefit from what we actually
know about the token, i.e. to not apply normalisation if the POS tag is a person name.

Other suggestions and ideas are of course most welcome.

> Add Japanese Kanji number normalization to Kuromoji
> ---------------------------------------------------
>                 Key: LUCENE-3922
>                 URL:
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.0-ALPHA
>            Reporter: Kazuaki Hiraga
>              Labels: features
>         Attachments: LUCENE-3922.patch
> Japanese people use Kanji numerals instead of Arabic numerals for writing price, address
and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 十二月(December).
 So, we would like to normalize those Kanji numerals to Arabic numerals (I don't think we
need to have a capability to normalize to Kanji numerals).

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message