lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
Date Sun, 01 Apr 2012 18:02:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243799#comment-13243799
] 

Robert Muir commented on LUCENE-3940:
-------------------------------------

{quote}
I disagree with you, Robert. (If punctuation has no information content, why does it exist?)
IMHO Mike's examples are not at all extreme, e.g. some punctuation tokens could be used to
trigger position increment gaps.
{quote}

Punctuation simply doesn't tell you anything about the document: this is fact. if we start
indexing punctuation we just create useless terms that go to every document

Because of this, nobody wastes their time trying to figure out how index "punctuation tokens".
Mike's problem is basically the fact he is creating a compound token of '??' 

Furthermore, the idea that 'if we don't leave a hole for anything removed, we are losing formation'
is totally arbitrary, confusing, and inconsistent anyway. How come we leave holes for definitiveness
in english but not for plurals in english, but in arabic or bulgarian we don't leave holes
for definiteness, because it happens to be attached to the word and stemmed?

                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message