lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael J. Prichard" <michael_prich...@mac.com>
Subject Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)
Date Sun, 30 Jul 2006 17:37:57 GMT
Kewl :)

I updated the Filter....(for anyone interested).  Actually..if anyone 
wants I can zip it up and send it to them...let me know.

-------- EmailFilter

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.Token;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Stack;

public class EmailFilter extends TokenFilter {
    public static final String TOKEN_TYPE_EMAIL = "EMAILPART";

    private Stack emailTokenStack;
   
    public EmailFilter(TokenStream in) {
        super(in);
        emailTokenStack = new Stack();
    }

    public Token next() throws IOException {

        if (emailTokenStack.size() > 0) {
            return (Token) emailTokenStack.pop();
        }   

        Token token = input.next();
        if (token == null) {
            return null;
        }

        addEmailPartsToStack(token);

        return token;
    }
   
    private void addEmailPartsToStack(Token token) throws IOException {
        String[] parts = getEmailParts(token.termText());

        if (parts == null) return;

        for (int i = 0; i < parts.length; i++) {
            Token synToken = new Token(parts[i],
                                 token.startOffset(),
                                 token.endOffset(),
                                 TOKEN_TYPE_EMAIL);
            synToken.setPositionIncrement(0);

            emailTokenStack.push(synToken);
        }
    }

    /*
     * Parses emails into its parts for tokenization.
     * For example john@foo.com would be broken into
     *
     *    [john@foo.com]
     *    [john]
     *    [foo.com]
     *    [foo]
     *    [com]
     *      
     */
    private String[] getEmailParts(String email) {

        // array for the parts
        String[] emailParts;
        // so i can add them before calling toArray
        ArrayList partsList = new ArrayList();

        /* let's do it */
        // split on the @
        String[] splitOnAmpersand = email.split("@");
        // add the username
        try {
            partsList.add(splitOnAmpersand[0]);
        } catch (ArrayIndexOutOfBoundsException ae) {
            // ignore
        }

        // add the full host name
        try {
            partsList.add(splitOnAmpersand[1]);
        } catch (ArrayIndexOutOfBoundsException ae) {
            // ignore
        }

        // split the host name into pieces
        if (splitOnAmpersand.length > 1) {
            String[] splitOnDot = splitOnAmpersand[1].split("\\.");
            // add all pieces from splitOnDot
            for (int i=0; i < splitOnDot.length; i++) {
                partsList.add(splitOnDot[i]);
            }

            /*
             *  if this is great than 2 then we need to add the domain 
name which
             *  should be the last two
             * 
             */
            if (splitOnDot.length > 2) {
                String domain = splitOnDot[splitOnDot.length-2] + "." + 
splitOnDot[splitOnDot.length-1];
                // add domain
                partsList.add(domain);
            }
        }
       
        return (String[]) partsList.toArray(new String[0]);       
    }

}

---- end EmailFilter




Otis Gospodnetic wrote:

>No, you're not missing anything. :)
>That JavaMail API is good for getting the whole email, but you then need to chop it up
with your EmailAnalyzer, so you're doing the right thing.
>
>Otis
>
>----- Original Message ----
>From: Michael J. Prichard <michael_prichard@mac.com>
>To: java-user@lucene.apache.org
>Sent: Saturday, July 29, 2006 2:51:59 PM
>Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)
>
>Hasan Diwan wrote:
>
>  
>
>>Michael:
>>
>>On 7/28/06, Michael J. Prichard <michael_prichard@mac.com> wrote:
>>
>>    
>>
>>>Howdy....not sure if anyone else wants this but here is my first attempt
>>>at writing an analyzer for an email address...modifications, updates,
>>>fixes welcome.
>>>      
>>>
>>Why reinvent the wheel? See
>>http://java.sun.com/products/javamail/javadocs/javax/mail/internet/InternetAddress.html#parse(java.lang.String)

>>
>>and use as:
>>
>>InternetAddress valid = InternetAddress.parse(string)[0]; // far
>>simpler than rewriting it
>>
>>    
>>
>i dont see where i can break an email address into simpler pieces for 
>tokens.  i use javamail when parsing the message and then pulling the 
>email using InternetAddress.  I don't see where I can break an email 
>address like john@foo.com into "john@foo.com", "john", "foo.com", "foo" 
>and "com" without splitting it.  Am I missing something?
>
>Thanks!
>Michael
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message