lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji
Date Sat, 06 Oct 2012 10:37:03 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470967#comment-13470967
] 

Christian Moen commented on LUCENE-3921:
----------------------------------------

Lance,

The idea I had in mind for Japanese uses language specific characteristics for katakana terms
and perhaps weights that are dictionary-specific as well.  However, we are hacking the our
statistical model here and there are limitations as to how far we can go with this.

I don't know a whole lot about the Smart Chinese toolkit, but I believe the same approach
to compound segmentation could work for Chinese as well.  However, weights and implementation
would likely to be separate.  Note that the above is really about one specific kind of compound
segmentation that applies to Japanese so the thinking was to add additional heuristics for
this specific type that is particularly tricky.

It might be a good idea to approach this problem also using the {{DictionaryCompoundWordTokenFilter}}
and collectively build some lexical assets for compound splitting for the relevant languages
than hacking our models.
                
> Add decompose compound Japanese Katakana token capability to Kuromoji
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-3921
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3921
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.0-ALPHA
>         Environment: Cent OS 5, IPA Dictionary, Run with "Search mdoe"
>            Reporter: Kazuaki Hiraga
>              Labels: features
>
> Japanese morphological analyzer, Kuromoji doesn't have a capability to decompose every
Japanese Katakana compound tokens to sub-tokens. It seems that some Katakana tokens can be
decomposed, but it cannot be applied every Katakana compound tokens. For instance, "トートバッグ(tote
bag)" and "ショルダーバッグ" don't decompose into "トート バッグ" and "ショルダー
バッグ" although the IPA dictionary has "バッグ" in its entry.  I would like to apply
the decompose feature to every Katakana tokens if the sub-tokens are in the dictionary or
add the capability to force apply the decompose feature to every Katakana tokens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message