lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject StandardAnalyzer & e-mail addresses
Date Mon, 06 Oct 2003 16:46:44 GMT
I'm not JavaCC-savvy enough (yet), but it seems there is a flaw in the 
StandardTokenizer and its determination of e-mail addresses.

If I analyze "xyz@example.com", it splits into two tokens: 
"xyz@example" and "com".  Shouldn't this rule:

   // email addresses
| <EMAIL: <ALPHANUM> ("."|"-"|"_" <ALPHANUM>)+ "@" <ALPHANUM> ("."|"-"

<ALPHANUM>)+ >

Be clever enough to keep the .com with it?  Perhaps some other parsing 
is taking precedence?

Thanks,
	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message