lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charlie Hubbard <charlie.hubb...@gmail.com>
Subject Re: StandardAnalyzer and Email Addresses
Date Sun, 26 Feb 2012 00:51:23 GMT
I am using StandardAnalyzer in 3.1.  I'd been previously using 2.4 and from
that documentation it states email address are recognized:

http://javasourcecode.org/html/open-source/lucene/lucene-2.4.0/org/apache/lucene/analysis/standard/StandardTokenizer.html

It looks like this was changed in 3.x according to this doc now:

http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/all/org/apache/lucene/analysis/standard/ClassicTokenizer.html

I think I've found a work around in that if I search for email address like:

to:"charlie.hubbard@gmail.com"

Then it will look for the full email address.  What is the draw back of
using the quoted version?  Is the performance worse doing this?  How much
worse?  I'm not sure how quoted searches are implemented so it's hard for
me to gauge what the draw back is.

Thanks
Charlie

On Mon, Feb 20, 2012 at 12:23 PM, Ian Lea <ian.lea@gmail.com> wrote:

> Are you using StandardAnalyzer in 3.1+?  You may want to use
> ClassicAnalyzer instead.  I can't see where in the 3.5 javadocs it
> says that email addresses are recognized, but it does sound vaguely
> familiar.
>
>
> --
> Ian.
>
>
> On Thu, Feb 16, 2012 at 5:18 PM, Charlie Hubbard
> <charlie.hubbard@gmail.com> wrote:
> > This is a pretty simple question to answer, but I have customers asking
> me
> > how this is suppose to work and I'm having trouble explaining it.  I have
> > an app that indexes emails so there are plenty of email addresses in
> there.
> >  Reading the StandardAnalyzer javadoc it says it "recognizes" email
> > addresses when it is creating the token list.  What tokens will it
> produce
> > exactly?  What I'm seeing when I perform searches is the email address
> > looks like its being tokenized into its parts.  Searching by an email
> > address like:
> >
> > to:charlie.hubbard@gmail.com
> >
> > pulls back more hits that haven't been addressed to
> > charlie.hubbard@gmail.com.  Other messages with gmail.com in them are
> > returned.  If I use the following:
> >
> > to:charlie.hubbard
> >
> > in them.  It also finds gmail.com, and other domains.  And I can search
> for
> > strings like
> >
> > to:"charlie.hubbard@gmail.com"
> >
> > it will pull back only emails addressed to that address.  Further proof
> it
> > seems to token the parts of an email is if I search for a very specific
> > email address like:
> >
> > to:"charlie.hubbard+sometag"
> >
> > That will pull back only emails addressed to that email, but it's not a
> > full email address.  Which leads me to think it will parse parts of the
> > email addresses.  Can someone explain this a little more?
> >
> > I'm having trouble with some emails that can't be pulled back using the
> > username like searching for to:chubbard where the email was addressed to
> > chubbard@somedomain.com, but it fails to show up in the search results.
>  I
> > can't explain why that's happening.  In all of my tests I can't reproduce
> > it and I think I might have to reindex everything because this was an
> index
> > built with 2.4 and I upgraded to 3.1 so I'm worried it might be
> corrupted.
> >
> > Thoughts?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message