lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3767) Explore streaming Viterbi search in Kuromoji
Date Thu, 16 Feb 2012 17:53:03 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209540#comment-13209540
] 

Michael McCandless commented on LUCENE-3767:
--------------------------------------------

I think the branch is ready to land... I'll post an applyable patch
soon.

In Mode.SEARCH the tokenizer produces the same tokens as current
trunk.

The only real end-user visible change is the addition of
Mode.SEARCH_WITH_COMPOUNDS, which can produce two paths (compound
token + its segmentation).  This mode uses the new
PositionLengthAttribute to record how "long" the compound token is.

In this mode, the Viterbi search first runs without penalties, but
then, if a too-long token (a token where the penalty would have been >
0) is in the best path, we effectively re-run the Viterbi under that
compound token, this time with penalties included.  If this results in
a different backtrace, we add that into the output tokens as well.

Note that this will not produce congruent results as Mode.SEARCH,
because the 2nd segmentation runs "in context" of the best path,
meaning the chosen best wordID before and after the compound token are
"enforced" in the 2nd segmentation.  Sometimes this results in still
picking only the compound token where trunk today would have split it
up.  From TestQuality, the total number of edits was 4418 vs trunk's
4828.

I didn't explore this, but, we may want to use harsher penalties in
SEARCH_WITH_COMPOUNDS mode, ie, since we're going to output the
compound as well we may as well "try harder" to produce the 2nd best
segmentation.

I left the default mode as Mode.SEARCH... maybe if we can somehow
run some relevance tests we can make the default SEARCH_WITH_COMPOUNDS.
But it'd also be tricky at query time...

It looks like the rolling Viterbi is a bit faster (~16%: 1460
bytes/msec vs 1700 bytes/msec on TestQuality.testSingleText).

                
> Explore streaming Viterbi search in Kuromoji
> --------------------------------------------
>
>                 Key: LUCENE-3767
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3767
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3767.patch, LUCENE-3767.patch
>
>
> I've been playing with the idea of changing the Kuromoji viterbi
> search to be 2 passes (intersect, backtrace) instead of 4 passes
> (break into sentences, intersect, score, backtrace)... this is very
> much a work in progress, so I'm just getting my current state up.
> It's got tons of nocommits, doesn't properly handle the user dict nor
> extended modes yet, etc.
> One thing I'm playing with is to add a double backtrace for the long
> compound tokens, ie, instead of penalizing these tokens so that
> shorter tokens are picked, leave the scores unchanged but on backtrace
> take that penalty and use it as a threshold for a 2nd best
> segmentation...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message