lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Commented) (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
Date Mon, 02 Apr 2012 11:01:22 GMT


Robert Muir commented on LUCENE-3940:

Perhaps it's useful to distinguish between analysis for information retrieval and analysis
for information extraction here?

Yes, since we are an information retrieval library, then there is no sense in adding *traps*
and *complexity* to our analysis API.

I like Michael's and Steven's idea of doing tokenization that doesn't discard any information.

For IR, this is definitely not information... calling it data is a stretch.

If there's an established convention that Tokenizer variants discards punctuation and produces
the terms that are meant to be directly searchable, it sounds like a good idea that we stick
to the convention here as well.

Thats what the tokenizers do today, they find tokens (In the IR sense). So yeah, there is
an established convention already. Changing
this would be a *monster trap* because suddenly tons of people would be indexing tons of useless
punctuation. I would strongly
oppose such a change.

> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>                 Key: LUCENE-3940
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message