lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael J. Prichard" <michael_prich...@mac.com>
Subject Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)
Date Mon, 31 Jul 2006 13:27:31 GMT
Hey Otis,

Sure I would love to!  Can you ping me at michael.prichard@mac.com and 
let me know what I need to do?   Do I just post it to JIRA?

Thanks,
Michael

Otis Gospodnetic wrote:

>A good place for that in JIRA.  could you put it there?  We have a bunch of analyzers
in Lucene's contrib, so if you are okay with putting Apache license on top of the source code,
we can include it there.  Same for EmailAnalyzer.
>
>Otis
>
>
>----- Original Message ----
>From: Michael J. Prichard <michael_prichard@mac.com>
>To: java-user@lucene.apache.org
>Sent: Sunday, July 30, 2006 1:37:57 PM
>Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)
>
>Kewl :)
>
>I updated the Filter....(for anyone interested).  Actually..if anyone 
>wants I can zip it up and send it to them...let me know.
>
>-------- EmailFilter
>
>import org.apache.lucene.analysis.TokenStream;
>import org.apache.lucene.analysis.TokenFilter;
>import org.apache.lucene.analysis.Token;
>import java.io.IOException;
>import java.util.ArrayList;
>import java.util.Stack;
>
>public class EmailFilter extends TokenFilter {
>    public static final String TOKEN_TYPE_EMAIL = "EMAILPART";
>
>    private Stack emailTokenStack;
>   
>    public EmailFilter(TokenStream in) {
>        super(in);
>        emailTokenStack = new Stack();
>    }
>
>    public Token next() throws IOException {
>
>        if (emailTokenStack.size() > 0) {
>            return (Token) emailTokenStack.pop();
>        }   
>
>        Token token = input.next();
>        if (token == null) {
>            return null;
>        }
>
>        addEmailPartsToStack(token);
>
>        return token;
>    }
>   
>    private void addEmailPartsToStack(Token token) throws IOException {
>        String[] parts = getEmailParts(token.termText());
>
>        if (parts == null) return;
>
>        for (int i = 0; i < parts.length; i++) {
>            Token synToken = new Token(parts[i],
>                                 token.startOffset(),
>                                 token.endOffset(),
>                                 TOKEN_TYPE_EMAIL);
>            synToken.setPositionIncrement(0);
>
>            emailTokenStack.push(synToken);
>        }
>    }
>
>    /*
>     * Parses emails into its parts for tokenization.
>     * For example john@foo.com would be broken into
>     *
>     *    [john@foo.com]
>     *    [john]
>     *    [foo.com]
>     *    [foo]
>     *    [com]
>     *      
>     */
>    private String[] getEmailParts(String email) {
>
>        // array for the parts
>        String[] emailParts;
>        // so i can add them before calling toArray
>        ArrayList partsList = new ArrayList();
>
>        /* let's do it */
>        // split on the @
>        String[] splitOnAmpersand = email.split("@");
>        // add the username
>        try {
>            partsList.add(splitOnAmpersand[0]);
>        } catch (ArrayIndexOutOfBoundsException ae) {
>            // ignore
>        }
>
>        // add the full host name
>        try {
>            partsList.add(splitOnAmpersand[1]);
>        } catch (ArrayIndexOutOfBoundsException ae) {
>            // ignore
>        }
>
>        // split the host name into pieces
>        if (splitOnAmpersand.length > 1) {
>            String[] splitOnDot = splitOnAmpersand[1].split("\\.");
>            // add all pieces from splitOnDot
>            for (int i=0; i < splitOnDot.length; i++) {
>                partsList.add(splitOnDot[i]);
>            }
>
>            /*
>             *  if this is great than 2 then we need to add the domain 
>name which
>             *  should be the last two
>             * 
>             */
>            if (splitOnDot.length > 2) {
>                String domain = splitOnDot[splitOnDot.length-2] + "." + 
>splitOnDot[splitOnDot.length-1];
>                // add domain
>                partsList.add(domain);
>            }
>        }
>       
>        return (String[]) partsList.toArray(new String[0]);       
>    }
>
>}
>
>---- end EmailFilter
>
>
>
>
>Otis Gospodnetic wrote:
>
>  
>
>>No, you're not missing anything. :)
>>That JavaMail API is good for getting the whole email, but you then need to chop it
up with your EmailAnalyzer, so you're doing the right thing.
>>
>>Otis
>>
>>----- Original Message ----
>>From: Michael J. Prichard <michael_prichard@mac.com>
>>To: java-user@lucene.apache.org
>>Sent: Saturday, July 29, 2006 2:51:59 PM
>>Subject: Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)
>>
>>Hasan Diwan wrote:
>>
>> 
>>
>>    
>>
>>>Michael:
>>>
>>>On 7/28/06, Michael J. Prichard <michael_prichard@mac.com> wrote:
>>>
>>>   
>>>
>>>      
>>>
>>>>Howdy....not sure if anyone else wants this but here is my first attempt
>>>>at writing an analyzer for an email address...modifications, updates,
>>>>fixes welcome.
>>>>     
>>>>
>>>>        
>>>>
>>>Why reinvent the wheel? See
>>>http://java.sun.com/products/javamail/javadocs/javax/mail/internet/InternetAddress.html#parse(java.lang.String)

>>>
>>>and use as:
>>>
>>>InternetAddress valid = InternetAddress.parse(string)[0]; // far
>>>simpler than rewriting it
>>>
>>>   
>>>
>>>      
>>>
>>i dont see where i can break an email address into simpler pieces for 
>>tokens.  i use javamail when parsing the message and then pulling the 
>>email using InternetAddress.  I don't see where I can break an email 
>>address like john@foo.com into "john@foo.com", "john", "foo.com", "foo" 
>>and "com" without splitting it.  Am I missing something?
>>
>>Thanks!
>>Michael
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> 
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message