lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
Date Mon, 02 Nov 2015 17:43:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985598#comment-14985598
] 

David Smiley commented on LUCENE-6874:
--------------------------------------

bq. So maybe we should solve this problem by adding some documentation?

If the vast majority (like 90%+) of users that currently use WhitespaceTokenizer would want
to tokenize on it, then I don't think documentation is sufficient at all.  Documenting something
most people would want to change is very very easy to overlook.  That's what I call a _trap_;
not that there might be some uses for the current behavior.  Lucene should do what most users
want it do do by default.  As Jack said, the users of the search platform don't care what
Java's definition of Character.isWhitespace is.

I propose WhitespaceTokenizerFactory have a flag for this, and that it default to consider
NBSP a space based on Lucene's Version.

I get Uwe's point that there are other Tokenizers.  But I disagree that WhitespaceTokenizer
shouldn't be used for "classical full text".  For example StandardTokenizer tokenizes on hypthen
and thus foils some of the benefit of WordDelimiterFilter.  Maybe ICUTokenizer is an answer;
I haven't checked it's interaction with WDF.  But why can't we just have a tokenizer that
just tokenizes simply on all whitespace?

I'll have to see the links Rob just posted; I haven't read them yet.

> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR)
but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to work around
but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message