lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji
Date Tue, 31 Jul 2012 03:37:34 GMT


Christian Moen commented on LUCENE-3922:

I've attached a work-in-progress patch for {{trunk}} that implements a {{CharFilter}} that
normalizes Japanese numbers.

These are some TODOs and implementation considerations I have that I'd be thankful to get
feedback on:

* Buffering the entire input on the first read should be avoided.  The primary reason this
is done is because I was thinking to add some regexps before and after kanji numeric strings
to qualify their normalization, i.e. to only normalize strings that starts with ¥, JPY or
ends with 円, to only normalize monetary amounts in Japanese yen.  However, this probably
isn't necessary as we can probably can use {{Matcher.requireEnd()}} and {{Matcher.hitEnd()}}
to decide if we need to read more input. (Thanks, Robert!)

* Is qualifying the numbers to be normalized with prefix and suffix regexps useful, i.e. to
only normalize monetary amounts?

* How do we deal with leading zeros?  Currently, "007" and "◯◯七" becomes "7" today.
 Do we want an option to preserve leading zeros?

* How large numbers do we care about supporting?  Some of the larger numbers are surrogates,
which complicates implementation, but they're certainly possible.  If we don't care about
really large numbers, we can probably be fine working with {{long}} instead of {{BigInteger}}.

* Polite numbers and some other variants aren't supported, i.e. 壱, 弐, 参, etc., but they
can easily be added.  We can also add the obsolete variants if that's useful somehow.  Are
these useful?  Do we want them available via an option?

* Number formats such as "1億2,345万6,789" isn't supported - we don't
deal with the comma today, but this can be added.  The same applies to "12 345"
where there's a space that separates thousands like in French.  Numbers like "2・2兆" aren't
supported, but can be added.

* Only integers are supported today, so we can't parse "〇・一二三四", which becomes
"0" and "1234" as separate tokens instead of "0.1234"

There are probably other considerations, too, that I doesn't immediately come to mind.

Numbers are fairly complicated and feedback on direction for further implementation is most
appreciated.  Thanks.
> Add Japanese Kanji number normalization to Kuromoji
> ---------------------------------------------------
>                 Key: LUCENE-3922
>                 URL:
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.0-ALPHA
>            Reporter: Kazuaki Hiraga
>              Labels: features
>         Attachments: LUCENE-3922.patch
> Japanese people use Kanji numerals instead of Arabic numerals for writing price, address
and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 十二月(December).
 So, we would like to normalize those Kanji numerals to Arabic numerals (I don't think we
need to have a capability to normalize to Kanji numerals).

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message