lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (LUCENE-1556) some valid email address characters not correctly recognized
Date Wed, 29 Sep 2010 05:50:33 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir resolved LUCENE-1556.
---------------------------------

    Fix Version/s: 3.1
                   4.0
       Resolution: Fixed

fixed in LUCENE-2167

> some valid email address characters not correctly recognized
> ------------------------------------------------------------
>
>                 Key: LUCENE-1556
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1556
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 2.4.1
>            Reporter: Paul Nilsson
>            Priority: Trivial
>             Fix For: 3.1, 4.0
>
>
> the EMAIL expression in StandardTokenizerImpl.jflex misses some unusual but valid characters
in the left-hand-side of the email address. This causes an address to be broken into several
tokens, for example:
> somename+site@gmail.com gets broken into "somename" and "site@gmail.com"
> husband&wife@talktalk.net gets broken into "husband" and "wife@talktalk.net"
> These seem to be occurring more often. The first seems to be because of an anti-spam
trick you can use with google (see: http://labnol.blogspot.com/2007/08/gmail-plus-smart-trick-to-find-block.html).
I see the second in several domains but a disproportionate amount are from talktalk.net, so
I expect it's a signup suggestion from the service.
> Perhaps a fix would be to change line 102 of StandardTokenizerImpl.jflex from:
> EMAIL      =  {ALPHANUM} (("."|"-"|"_") {ALPHANUM})* "@" {ALPHANUM} (("."|"-") {ALPHANUM})+
> to 
> EMAIL      =  {ALPHANUM} (("."|"-"|"_"|"+"|"&") {ALPHANUM})* "@" {ALPHANUM} (("."|"-")
{ALPHANUM})+
> I'm aware that the StandardTokenizer is meant to be more of a basic implementation rather
than an implementation the full standard, but it is quite useful in places and hopefully this
would improve it slightly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message