lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
Date Sun, 01 Apr 2012 16:58:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243784#comment-13243784
] 

Steven Rowe commented on LUCENE-3940:
-------------------------------------

bq. I think its well accepted that words carry the information content of a doc, punctuation
has no information content really here, it doesn't tell me what the doc is about, and I don't
think this is controversial, I just think your view on this is extreme...

I disagree with you, Robert.  (If punctuation has no information content, why does it exist?)
 IMHO Mike's examples are not at all extreme, e.g. some punctuation tokens could be used to
trigger position increment gaps.

bq. StandardTokenizer doesnt leave holes when it drops punctuation, I think holes should only
be real 'words' for the most part

"Standard"Tokenizer is drawn from Unicode UAX#29, which only describes word *boundaries*.
 Lucene has grafted onto these boundary rules an assumption that only alphanumeric "words"
should be tokens - this assumption does not exist in the standard itself.

My opinion is that we should have both types of things: a tokenizer that discards non-alphanumeric
characters between word boundaries; and different type of analysis component that discards
nothing.  I think of the discard-nothing process as *segmentation* rather than tokenization,
and I've [argued for it previously|https://issues.apache.org/jira/browse/LUCENE-2498?focusedCommentId=12878963&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12878963].
                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message