lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3767) Explore streaming Viterbi search in Kuromoji
Date Tue, 21 Feb 2012 14:42:37 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212623#comment-13212623
] 

Christian Moen commented on LUCENE-3767:
----------------------------------------

Mike,

Thanks a lot for this.  I'd meant to comment on this earlier and I'd like to look further
into the details, but I really like your idea of running the Viterbi in a streaming fashion.

Kuromoji originally split input using two punctuation characters as this would be an articulation
point in the lattice/graph in practice, but your idea is much more elegant and also faithful
to the statistical model.

As for dealing with compounds, the penalization is a crude hack as you know, but it turns
to work quite well in practice as many of the "decompounds" are known to the statistical model.
 However, in cases where not not all of them are known, we sometimes get wrong decomounds.
 I've done some analysis of these cases and it's possible to add more heuristics to deal with
some that are obviouslt wrong, such a word starting with a long vowel sound in katakana. 
This is a slippery slope that I'm reluctant to pursue...

Robert mentioned earlier that he believes IPADIC could have been annotated with compounds
as the documentation mentions them, but they're not part of the IPADIC model we are using.
 If it is possible to get the decompounds from the training data (Kyoto Corpus), a better
overall approach is then to do regular segmentation (normal mode) and then provide the decompounds
directly from the token info for the compounds.  We might need to retrain the model and preserving
the decompounds in order for this to work, but I think it is worth investigating.
                
> Explore streaming Viterbi search in Kuromoji
> --------------------------------------------
>
>                 Key: LUCENE-3767
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3767
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3767.patch, LUCENE-3767.patch, LUCENE-3767.patch, compound_diffs.txt
>
>
> I've been playing with the idea of changing the Kuromoji viterbi
> search to be 2 passes (intersect, backtrace) instead of 4 passes
> (break into sentences, intersect, score, backtrace)... this is very
> much a work in progress, so I'm just getting my current state up.
> It's got tons of nocommits, doesn't properly handle the user dict nor
> extended modes yet, etc.
> One thing I'm playing with is to add a double backtrace for the long
> compound tokens, ie, instead of penalizing these tokens so that
> shorter tokens are picked, leave the scores unchanged but on backtrace
> take that penalty and use it as a threshold for a 2nd best
> segmentation...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message