lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject StandardAnalyzer & e-mail addresses
Date Mon, 06 Oct 2003 16:46:44 GMT
I'm not JavaCC-savvy enough (yet), but it seems there is a flaw in the 
StandardTokenizer and its determination of e-mail addresses.

If I analyze "", it splits into two tokens: 
"xyz@example" and "com".  Shouldn't this rule:

   // email addresses
| <EMAIL: <ALPHANUM> ("."|"-"|"_" <ALPHANUM>)+ "@" <ALPHANUM> ("."|"-"


Be clever enough to keep the .com with it?  Perhaps some other parsing 
is taking precedence?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message