lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (Updated) (JIRA)" <>
Subject [jira] [Updated] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
Date Sun, 01 Apr 2012 14:02:26 GMT


Michael McCandless updated LUCENE-3940:

    Attachment: LUCENE-3940.patch

New test-only patch, breaking out the non-controversial (I think!)
part of the patch.

With this new patch, Kuromoji still silently discards punctuation
(just like StandardAnalyzer), but at least we get better test coverage
in BTSTC to verify graph tokens are not messing up their offsets.

I had to turn it off when testing Kuromoji w/ punctuation
removal... but it's still tested w/o punctuation removal, so I think
it'd likely catch any bugs in how Kuromoji sets offsets of the
compound tokens... at least it's better than not checking at all
(ie, today).

The only non-tests-only change is I uncommented an assert in Kuromoji;
I think it's a valid assert.

> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>                 Key: LUCENE-3940
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message