lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lance Norskog (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji
Date Sun, 07 Oct 2012 00:35:03 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471121#comment-13471121
] 

Lance Norskog edited comment on LUCENE-3921 at 10/7/12 12:33 AM:
-----------------------------------------------------------------

Statistical models and rule-based models always have a failure rate. When you use them you
have to decide what to do about the failures. Attacking the failures with another model drives
toward Xeno's Paradox. For Chinese language search, breaking the failures into bigrams makes
a lot of sense. The CJK bigram generator creates a massive amount of bogus bigrams. Bogus
bigrams case bogus results from sloppy phrase searches.

Smart Chinese and Kuromoji are not systems for doing natural-language processing). They are
systems for minimizing bogus bigrams. This allows sloppy phrase queries to find fewer bogus
results. In my use case, Smart Chinese created only 2% (40k/1.8m) of the possible bigrams.
[SOLR-3653] is the result of my experience in supporting searching Chinese legal documents.
I have some useful numbers at the end of the page.


                
      was (Author: lancenorskog):
    Statistical models and rule-based models always have a failure rate. When you use them
you have to decide what to do about the failures. Attacking the failures with another model
drives toward Xeno's Paradox. For Chinese language search, breaking the failures into bigrams
makes a lot of sense.

Another way to look at this is that Smart Chinese and Kuromoji are systems for minimizing
bogus bigrams. This allows phrase queries to function without finding bogus results. The CJK
bigram creator generates bogus bigrams, which cause phrase queries to find bogus results.
[SOLR-3653] is the result of my experience in supporting searching Chinese legal documents.
I have some useful numbers at the end of the page.


                  
> Add decompose compound Japanese Katakana token capability to Kuromoji
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-3921
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3921
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.0-ALPHA
>         Environment: Cent OS 5, IPA Dictionary, Run with "Search mdoe"
>            Reporter: Kazuaki Hiraga
>              Labels: features
>
> Japanese morphological analyzer, Kuromoji doesn't have a capability to decompose every
Japanese Katakana compound tokens to sub-tokens. It seems that some Katakana tokens can be
decomposed, but it cannot be applied every Katakana compound tokens. For instance, "トートバッグ(tote
bag)" and "ショルダーバッグ" don't decompose into "トート バッグ" and "ショルダー
バッグ" although the IPA dictionary has "バッグ" in its entry.  I would like to apply
the decompose feature to every Katakana tokens if the sub-tokens are in the dictionary or
add the capability to force apply the decompose feature to every Katakana tokens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message