lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (Issue Comment Edited) (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji
Date Mon, 26 Mar 2012 10:58:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238253#comment-13238253
] 

Christian Moen edited comment on LUCENE-3921 at 3/26/12 10:57 AM:
------------------------------------------------------------------

Hello, Kazu.  Long time no see -- I hope things are well!

This is very good feature request.  I think this might be possible by changing how we emit
unknown words, i.e. by not emitting them as greedily and giving the lattice more segmentation
options.  For example, if we find an unknown word トートバッグ (by regular greedy matching),
we can emit

{noformat}
ト
トー
トート
トートバ
トートバッ
トートバッグ
{noformat}

in the current position.  When we reach the position that starts with バッグ we'll find
a known word.  When the Viterbi runs, it's likely to choose トート and バッグ as its
best path.

Let me have a play by looking into the lattice details and see if something like this is feasible.
 We are sort of hacking the model here so we also need to consider side-effects.
                
      was (Author: cm):
    Hello, Kazu.  Long time no see -- I hope things are well!

This is very good feature request.  I think this is possible by changing how we emit unknown
words, i.e. by not emitting them as greedily and giving the lattice more segmentation options.
 For example, if we find an unknown word トートバッグ (by regular greedy matching),
we can emit

{noformat}
ト
トー
トート
トートバ
トートバッ
トートバッグ
{noformat}

in the current position.  When we reach the position that starts with バッグ, we'll find
a known word, and when the Viterbi runs, it's likely to choose トート and バッグ as
the best path.

Let me have a play by looking into the lattice details and see if something like this is feasible.
                  
> Add decompose compound Japanese Katakana token capability to Kuromoji
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-3921
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3921
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.0
>         Environment: Cent OS 5, IPA Dictionary, Run with "Search mdoe"
>            Reporter: Kazuaki Hiraga
>              Labels: features
>
> Japanese morphological analyzer, Kuromoji doesn't have a capability to decompose every
Japanese Katakana compound tokens to sub-tokens. It seems that some Katakana tokens can be
decomposed, but it cannot be applied every Katakana compound tokens. For instance, "トートバッグ(tote
bag)" and "ショルダーバッグ" don't decompose into "トート バッグ" and "ショルダー
バッグ" although the IPA dictionary has "バッグ" in its entry.  I would like to apply
the decompose feature to every Katakana tokens if the sub-tokens are in the dictionary or
add the capability to force apply the decompose feature to every Katakana tokens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message