lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
Date Mon, 09 Nov 2015 23:30:11 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997645#comment-14997645
] 

Adrien Grand commented on LUCENE-6874:
--------------------------------------

I tend to like Uwe's idea. I have often wondered what the actual use-cases of WhitespaceTokenizer
were but did not suggest to remove it as the cost of maintenance was very low given its simplicity.
However now that there is some controversy arising and given how simple it is to create character-based
tokenizers in trunk {{Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace);}},
maybe we should just remove this tokenizer and let users define it themselves with the more
flexible {{CharTokenizer.fromSeparatorCharPredicate}}?

> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>         Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR)
but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to work around
but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message