lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Created: (LUCENE-2763) Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer
Date Mon, 15 Nov 2010 18:26:13 GMT
Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer
---------------------------------------------------------------

                 Key: LUCENE-2763
                 URL: https://issues.apache.org/jira/browse/LUCENE-2763
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
    Affects Versions: 3.1, 4.0
            Reporter: Steven Rowe
             Fix For: 3.1, 4.0


Currently, in addition to implementing the UAX#29 word boundary rules, StandardTokenizer recognizes
email adresses and URLs, but doesn't provide a way to turn this behavior off and/or provide
overlapping tokens with the components (username from email address, hostname from URL, etc.).

UAX29Tokenizer should become StandardTokenizer, and current StandardTokenizer should be renamed
to something like UAX29TokenizerPlusPlus (or something like that).

For rationale, see [the discussion at the reopened LUCENE-2167|https://issues.apache.org/jira/browse/LUCENE-2167?focusedCommentId=12929325&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12929325].

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message