lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (Issue Comment Edited) (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji
Date Tue, 27 Mar 2012 05:33:45 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239195#comment-13239195
] 

Christian Moen edited comment on LUCENE-3921 at 3/27/12 5:32 AM:
-----------------------------------------------------------------

I've been experimenting with the idea outlined above and I thought I should share some very
early results.

The improvement here is basically to give the compound splitting heuristic an improved ability
to split unknown words that are part of compounds.  Experiments I've run using using our compound
splitting test cases suggest that the effect is indeed positive.  The improved heuristic is
able to handle some of the test case that we couldn't do earlier, but all of this requires
further experimentation and validation.

I've been able to segment トートバッグ (tote bag with トート being unknown) and also
ショルダーバッグ (shoulder bag) as you would like with some weight tweaks, but then
it also segmented エンジニアリング (engineering) into エンジニア (engineer) リング
(ring).

It might be possible to tune this up or developer a more advanced heuristic that remedies
this, but I haven't had a chance to look further into this.  Also, any change here would require
extensive testing and validation.  See the evaluation attached to LUCENE-3726 that was done
on Wikipedia for search mode.

Please note that there will not be time to provide improvements here for 3.6, but we can follow
up on katakana segmentation for 4.0.

With the above idea for katakana in mind, I'm thinking we can skip emitting katakana words
that start with ン、ッ、ー since we don't want tokens that start with these characters
and consider adding this as an option to the tokenizer if it works well.

Having said this, there are real limits to what we can achieve by hacking the statistical
model (and it also affects our karma, you know...).  The approach above also has performance
and memory impact.  We'd need to introduce a fairly short limits to how long unknown words
can be and this can perhaps only apply to unknown katakana words. The length restriction will
be big enough to not have any practical impact on segmentation, though.

An alternative approach to all of this is to build some lexical assets.  I think we'd get
pretty far for katakana if we apply some of the corpus-based compound-splitting algorithms
European NLP researchers have developed.  Some of these algorithms are pretty simple and quite
effective.

Thoughts?

                
      was (Author: cm):
    I've been experimenting with the idea outlined above and I thought I should share some
very early results.

The improvement here is basically to give the compound splitting heuristic an improved ability
to split unknown words that are part of compounds.  Experiments I've run using using our compound
splitting test cases suggest that the effect is indeed positive.  The improved heuristic is
able to handle some of the test case that we couldn't do earlier, but all of this requires
further experimentation and validation.

I've been able to segment トートバッグ (tote bag with トート being unknown) and also
ショルダーバッグ (shoulder bag) as you would like with some weight tweaks, but then
it also segmented エンジニアリング (engineering) into エンジニア (engineer) リング
(ring).

It might be possible to tune this up or developer a more advanced heuristic that remedies
this, but I haven't had a chance to look further into this.  Also, any change here would require
extensive testing and validation.  See the evaluation attached to LUCENE-3726 that was done
on Wikipedia for search mode.

Please note that there will not be time to provide improvements here for 3.6, but we can follow
up on katakana segmentation for 4.0.

With the above idea for katakana in mind, I'm thinking we can skip emitting katakana words
that start with ン、ッ、ー since we don't want tokens that start with these characters
and consider adding this as an option to the tokenizer if it works well.

Having said this, there are real limits to what we can achieve by hacking the statistical
model (and it also affects our karma, you know...).  The approach above also has performance
and memory impact.  We'd need to introduce a fairly short limits to how long unknown words
can be and this can perhaps only apply to unknown katakana words. The length restriction will
be big enough to not have any practical impact on segmentation, though.

An alternative approach to all of this is to build some lexical assets.  I think we'd get
pretty far for katakana if we apply some of the corpus-based compound-splitting algorithms
Europeans NLP researchers have developed.  These algorithms are simple and quite effective.

Thoughts?

                  
> Add decompose compound Japanese Katakana token capability to Kuromoji
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-3921
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3921
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.0
>         Environment: Cent OS 5, IPA Dictionary, Run with "Search mdoe"
>            Reporter: Kazuaki Hiraga
>              Labels: features
>
> Japanese morphological analyzer, Kuromoji doesn't have a capability to decompose every
Japanese Katakana compound tokens to sub-tokens. It seems that some Katakana tokens can be
decomposed, but it cannot be applied every Katakana compound tokens. For instance, "トートバッグ(tote
bag)" and "ショルダーバッグ" don't decompose into "トート バッグ" and "ショルダー
バッグ" although the IPA dictionary has "バッグ" in its entry.  I would like to apply
the decompose feature to every Katakana tokens if the sub-tokens are in the dictionary or
add the capability to force apply the decompose feature to every Katakana tokens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message