Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Sun, 7 Oct 2012 00:35:03 +0000 (UTC)
From: "Lance Norskog (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: <630559801.5782.1349570103161.JavaMail.jiratomcat@arcas>
In-Reply-To: 
 <1713136568.16596.1332750265836.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Comment Edited] (LUCENE-3921) Add decompose compound
 Japanese Katakana token capability to Kuromoji
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/LUCENE-3921?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D134=
71121#comment-13471121 ]=20

Lance Norskog edited comment on LUCENE-3921 at 10/7/12 12:33 AM:
-----------------------------------------------------------------

Statistical models and rule-based models always have a failure rate. When y=
ou use them you have to decide what to do about the failures. Attacking the=
 failures with another model drives toward Xeno's Paradox. For Chinese lang=
uage search, breaking the failures into bigrams makes a lot of sense. The C=
JK bigram generator creates a massive amount of bogus bigrams. Bogus bigram=
s case bogus results from sloppy phrase searches.

Smart Chinese and Kuromoji are not systems for doing natural-language proce=
ssing). They are systems for minimizing bogus bigrams. This allows sloppy p=
hrase queries to find fewer bogus results. In my use case, Smart Chinese cr=
eated only 2% (40k/1.8m) of the possible bigrams. [SOLR-3653] is the result=
 of my experience in supporting searching Chinese legal documents. I have s=
ome useful numbers at the end of the page.


               =20
      was (Author: lancenorskog):
    Statistical models and rule-based models always have a failure rate. Wh=
en you use them you have to decide what to do about the failures. Attacking=
 the failures with another model drives toward Xeno's Paradox. For Chinese =
language search, breaking the failures into bigrams makes a lot of sense.

Another way to look at this is that Smart Chinese and Kuromoji are systems =
for minimizing bogus bigrams. This allows phrase queries to function withou=
t finding bogus results. The CJK bigram creator generates bogus bigrams, wh=
ich cause phrase queries to find bogus results. [SOLR-3653] is the result o=
f my experience in supporting searching Chinese legal documents. I have som=
e useful numbers at the end of the page.


                 =20
> Add decompose compound Japanese Katakana token capability to Kuromoji
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-3921
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3921
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.0-ALPHA
>         Environment: Cent OS 5, IPA Dictionary, Run with "Search mdoe"
>            Reporter: Kazuaki Hiraga
>              Labels: features
>
> Japanese morphological analyzer, Kuromoji doesn't have a capability to de=
compose every Japanese Katakana compound tokens to sub-tokens. It seems tha=
t some Katakana tokens can be decomposed, but it cannot be applied every Ka=
takana compound tokens. For instance, "=E3=83=88=E3=83=BC=E3=83=88=E3=83=90=
=E3=83=83=E3=82=B0(tote bag)" and "=E3=82=B7=E3=83=A7=E3=83=AB=E3=83=80=E3=
=83=BC=E3=83=90=E3=83=83=E3=82=B0" don't decompose into "=E3=83=88=E3=83=BC=
=E3=83=88 =E3=83=90=E3=83=83=E3=82=B0" and "=E3=82=B7=E3=83=A7=E3=83=AB=E3=
=83=80=E3=83=BC =E3=83=90=E3=83=83=E3=82=B0" although the IPA dictionary ha=
s "=E3=83=90=E3=83=83=E3=82=B0" in its entry.  I would like to apply the de=
compose feature to every Katakana tokens if the sub-tokens are in the dicti=
onary or add the capability to force apply the decompose feature to every K=
atakana tokens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrato=
rs
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org