Hey Otis,
Sure I would love to! Can you ping me at michael.prichard@mac.com and
let me know what I need to do? Do I just post it to JIRA?
Thanks,
Michael
Otis Gospodnetic wrote:
>A good place for that in JIRA. could you put it there? We have a bunch of analyzers
in Lucene's contrib, so if you are okay with putting Apache license on top of the source code,
we can include it there. Same for EmailAnalyzer.
>
>Otis
>
>
>----- Original Message ----
>From: Michael J. Prichard <michael_prichard@mac.com>
>To: java-user@lucene.apache.org
>Sent: Sunday, July 30, 2006 1:37:57 PM
>Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)
>
>Kewl :)
>
>I updated the Filter....(for anyone interested). Actually..if anyone
>wants I can zip it up and send it to them...let me know.
>
>-------- EmailFilter
>
>import org.apache.lucene.analysis.TokenStream;
>import org.apache.lucene.analysis.TokenFilter;
>import org.apache.lucene.analysis.Token;
>import java.io.IOException;
>import java.util.ArrayList;
>import java.util.Stack;
>
>public class EmailFilter extends TokenFilter {
> public static final String TOKEN_TYPE_EMAIL = "EMAILPART";
>
> private Stack emailTokenStack;
>
> public EmailFilter(TokenStream in) {
> super(in);
> emailTokenStack = new Stack();
> }
>
> public Token next() throws IOException {
>
> if (emailTokenStack.size() > 0) {
> return (Token) emailTokenStack.pop();
> }
>
> Token token = input.next();
> if (token == null) {
> return null;
> }
>
> addEmailPartsToStack(token);
>
> return token;
> }
>
> private void addEmailPartsToStack(Token token) throws IOException {
> String[] parts = getEmailParts(token.termText());
>
> if (parts == null) return;
>
> for (int i = 0; i < parts.length; i++) {
> Token synToken = new Token(parts[i],
> token.startOffset(),
> token.endOffset(),
> TOKEN_TYPE_EMAIL);
> synToken.setPositionIncrement(0);
>
> emailTokenStack.push(synToken);
> }
> }
>
> /*
> * Parses emails into its parts for tokenization.
> * For example john@foo.com would be broken into
> *
> * [john@foo.com]
> * [john]
> * [foo.com]
> * [foo]
> * [com]
> *
> */
> private String[] getEmailParts(String email) {
>
> // array for the parts
> String[] emailParts;
> // so i can add them before calling toArray
> ArrayList partsList = new ArrayList();
>
> /* let's do it */
> // split on the @
> String[] splitOnAmpersand = email.split("@");
> // add the username
> try {
> partsList.add(splitOnAmpersand[0]);
> } catch (ArrayIndexOutOfBoundsException ae) {
> // ignore
> }
>
> // add the full host name
> try {
> partsList.add(splitOnAmpersand[1]);
> } catch (ArrayIndexOutOfBoundsException ae) {
> // ignore
> }
>
> // split the host name into pieces
> if (splitOnAmpersand.length > 1) {
> String[] splitOnDot = splitOnAmpersand[1].split("\\.");
> // add all pieces from splitOnDot
> for (int i=0; i < splitOnDot.length; i++) {
> partsList.add(splitOnDot[i]);
> }
>
> /*
> * if this is great than 2 then we need to add the domain
>name which
> * should be the last two
> *
> */
> if (splitOnDot.length > 2) {
> String domain = splitOnDot[splitOnDot.length-2] + "." +
>splitOnDot[splitOnDot.length-1];
> // add domain
> partsList.add(domain);
> }
> }
>
> return (String[]) partsList.toArray(new String[0]);
> }
>
>}
>
>---- end EmailFilter
>
>
>
>
>Otis Gospodnetic wrote:
>
>
>
>>No, you're not missing anything. :)
>>That JavaMail API is good for getting the whole email, but you then need to chop it
up with your EmailAnalyzer, so you're doing the right thing.
>>
>>Otis
>>
>>----- Original Message ----
>>From: Michael J. Prichard <michael_prichard@mac.com>
>>To: java-user@lucene.apache.org
>>Sent: Saturday, July 29, 2006 2:51:59 PM
>>Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)
>>
>>Hasan Diwan wrote:
>>
>>
>>
>>
>>
>>>Michael:
>>>
>>>On 7/28/06, Michael J. Prichard <michael_prichard@mac.com> wrote:
>>>
>>>
>>>
>>>
>>>
>>>>Howdy....not sure if anyone else wants this but here is my first attempt
>>>>at writing an analyzer for an email address...modifications, updates,
>>>>fixes welcome.
>>>>
>>>>
>>>>
>>>>
>>>Why reinvent the wheel? See
>>>http://java.sun.com/products/javamail/javadocs/javax/mail/internet/InternetAddress.html#parse(java.lang.String)
>>>
>>>and use as:
>>>
>>>InternetAddress valid = InternetAddress.parse(string)[0]; // far
>>>simpler than rewriting it
>>>
>>>
>>>
>>>
>>>
>>i dont see where i can break an email address into simpler pieces for
>>tokens. i use javamail when parsing the message and then pulling the
>>email using InternetAddress. I don't see where I can break an email
>>address like john@foo.com into "john@foo.com", "john", "foo.com", "foo"
>>and "com" without splitting it. Am I missing something?
>>
>>Thanks!
>>Michael
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|