lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suba Suresh <su...@wolfram.com>
Subject Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)
Date Mon, 31 Jul 2006 13:39:46 GMT
I would like to use the email analyzer code. I am thinking of using it 
along with java mail api. I have two different projects. In one I have 
to parse the emails sent and extract the subject and the  email address. 
The other project I have to parse and index it in lucene for later 
search and retrieval.

thanks,
suba suresh.

Michael J. Prichard wrote:
> Kewl :)
> 
> I updated the Filter....(for anyone interested).  Actually..if anyone 
> wants I can zip it up and send it to them...let me know.
> 
> -------- EmailFilter
> 
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.TokenFilter;
> import org.apache.lucene.analysis.Token;
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.Stack;
> 
> public class EmailFilter extends TokenFilter {
>    public static final String TOKEN_TYPE_EMAIL = "EMAILPART";
> 
>    private Stack emailTokenStack;
>      public EmailFilter(TokenStream in) {
>        super(in);
>        emailTokenStack = new Stack();
>    }
> 
>    public Token next() throws IOException {
> 
>        if (emailTokenStack.size() > 0) {
>            return (Token) emailTokenStack.pop();
>        }  
>        Token token = input.next();
>        if (token == null) {
>            return null;
>        }
> 
>        addEmailPartsToStack(token);
> 
>        return token;
>    }
>      private void addEmailPartsToStack(Token token) throws IOException {
>        String[] parts = getEmailParts(token.termText());
> 
>        if (parts == null) return;
> 
>        for (int i = 0; i < parts.length; i++) {
>            Token synToken = new Token(parts[i],
>                                 token.startOffset(),
>                                 token.endOffset(),
>                                 TOKEN_TYPE_EMAIL);
>            synToken.setPositionIncrement(0);
> 
>            emailTokenStack.push(synToken);
>        }
>    }
> 
>    /*
>     * Parses emails into its parts for tokenization.
>     * For example john@foo.com would be broken into
>     *
>     *    [john@foo.com]
>     *    [john]
>     *    [foo.com]
>     *    [foo]
>     *    [com]
>     *          */
>    private String[] getEmailParts(String email) {
> 
>        // array for the parts
>        String[] emailParts;
>        // so i can add them before calling toArray
>        ArrayList partsList = new ArrayList();
> 
>        /* let's do it */
>        // split on the @
>        String[] splitOnAmpersand = email.split("@");
>        // add the username
>        try {
>            partsList.add(splitOnAmpersand[0]);
>        } catch (ArrayIndexOutOfBoundsException ae) {
>            // ignore
>        }
> 
>        // add the full host name
>        try {
>            partsList.add(splitOnAmpersand[1]);
>        } catch (ArrayIndexOutOfBoundsException ae) {
>            // ignore
>        }
> 
>        // split the host name into pieces
>        if (splitOnAmpersand.length > 1) {
>            String[] splitOnDot = splitOnAmpersand[1].split("\\.");
>            // add all pieces from splitOnDot
>            for (int i=0; i < splitOnDot.length; i++) {
>                partsList.add(splitOnDot[i]);
>            }
> 
>            /*
>             *  if this is great than 2 then we need to add the domain 
> name which
>             *  should be the last two
>             *             */
>            if (splitOnDot.length > 2) {
>                String domain = splitOnDot[splitOnDot.length-2] + "." + 
> splitOnDot[splitOnDot.length-1];
>                // add domain
>                partsList.add(domain);
>            }
>        }
>              return (String[]) partsList.toArray(new String[0]);          }
> 
> }
> 
> ---- end EmailFilter
> 
> 
> 
> 
> Otis Gospodnetic wrote:
> 
>> No, you're not missing anything. :)
>> That JavaMail API is good for getting the whole email, but you then 
>> need to chop it up with your EmailAnalyzer, so you're doing the right 
>> thing.
>>
>> Otis
>>
>> ----- Original Message ----
>> From: Michael J. Prichard <michael_prichard@mac.com>
>> To: java-user@lucene.apache.org
>> Sent: Saturday, July 29, 2006 2:51:59 PM
>> Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)
>>
>> Hasan Diwan wrote:
>>
>>  
>>
>>> Michael:
>>>
>>> On 7/28/06, Michael J. Prichard <michael_prichard@mac.com> wrote:
>>>
>>>   
>>>
>>>> Howdy....not sure if anyone else wants this but here is my first 
>>>> attempt
>>>> at writing an analyzer for an email address...modifications, updates,
>>>> fixes welcome.
>>>>     
>>>
>>> Why reinvent the wheel? See
>>> http://java.sun.com/products/javamail/javadocs/javax/mail/internet/InternetAddress.html#parse(java.lang.String)

>>>
>>> and use as:
>>>
>>> InternetAddress valid = InternetAddress.parse(string)[0]; // far
>>> simpler than rewriting it
>>>
>>>   
>>
>> i dont see where i can break an email address into simpler pieces for 
>> tokens.  i use javamail when parsing the message and then pulling the 
>> email using InternetAddress.  I don't see where I can break an email 
>> address like john@foo.com into "john@foo.com", "john", "foo.com", 
>> "foo" and "com" without splitting it.  Am I missing something?
>>
>> Thanks!
>> Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>  
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message