lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: StandardAnalyzer and Email Addresses
Date Sun, 26 Feb 2012 08:50:56 GMT
Hi,

If you want a Tokenizer for your Analyzer that supports eMail detection, use
UAX29URLEmailTokenizer (see http://goo.gl/evH97). There is no Analyzer
available that uses this Tokenizer, but you can define your own one like
StandardAnalyzer, but with this class as Tokenizer (not StandardTokenizer).
I am not sure why there is no Analyzer implementation already available,
maybe Steven Rowe knows more.

The trick with the phrase is of lower performance as it uses a PhraseQuery
internally, which is more expensive.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Charlie Hubbard [mailto:charlie.hubbard@gmail.com]
> Sent: Sunday, February 26, 2012 1:51 AM
> To: java-user@lucene.apache.org
> Subject: Re: StandardAnalyzer and Email Addresses
> 
> I am using StandardAnalyzer in 3.1.  I'd been previously using 2.4 and
from that
> documentation it states email address are recognized:
> 
> http://javasourcecode.org/html/open-source/lucene/lucene-
> 2.4.0/org/apache/lucene/analysis/standard/StandardTokenizer.html
> 
> It looks like this was changed in 3.x according to this doc now:
> 
>
http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/all/org/
> apache/lucene/analysis/standard/ClassicTokenizer.html
> 
> I think I've found a work around in that if I search for email address
like:
> 
> to:"charlie.hubbard@gmail.com"
> 
> Then it will look for the full email address.  What is the draw back of
using the
> quoted version?  Is the performance worse doing this?  How much worse?
I'm
> not sure how quoted searches are implemented so it's hard for me to gauge
> what the draw back is.
> 
> Thanks
> Charlie
> 
> On Mon, Feb 20, 2012 at 12:23 PM, Ian Lea <ian.lea@gmail.com> wrote:
> 
> > Are you using StandardAnalyzer in 3.1+?  You may want to use
> > ClassicAnalyzer instead.  I can't see where in the 3.5 javadocs it
> > says that email addresses are recognized, but it does sound vaguely
> > familiar.
> >
> >
> > --
> > Ian.
> >
> >
> > On Thu, Feb 16, 2012 at 5:18 PM, Charlie Hubbard
> > <charlie.hubbard@gmail.com> wrote:
> > > This is a pretty simple question to answer, but I have customers
> > > asking
> > me
> > > how this is suppose to work and I'm having trouble explaining it.  I
> > > have an app that indexes emails so there are plenty of email
> > > addresses in
> > there.
> > >  Reading the StandardAnalyzer javadoc it says it "recognizes" email
> > > addresses when it is creating the token list.  What tokens will it
> > produce
> > > exactly?  What I'm seeing when I perform searches is the email
> > > address looks like its being tokenized into its parts.  Searching by
> > > an email address like:
> > >
> > > to:charlie.hubbard@gmail.com
> > >
> > > pulls back more hits that haven't been addressed to
> > > charlie.hubbard@gmail.com.  Other messages with gmail.com in them
> > > are returned.  If I use the following:
> > >
> > > to:charlie.hubbard
> > >
> > > in them.  It also finds gmail.com, and other domains.  And I can
> > > search
> > for
> > > strings like
> > >
> > > to:"charlie.hubbard@gmail.com"
> > >
> > > it will pull back only emails addressed to that address.  Further
> > > proof
> > it
> > > seems to token the parts of an email is if I search for a very
> > > specific email address like:
> > >
> > > to:"charlie.hubbard+sometag"
> > >
> > > That will pull back only emails addressed to that email, but it's
> > > not a full email address.  Which leads me to think it will parse
> > > parts of the email addresses.  Can someone explain this a little more?
> > >
> > > I'm having trouble with some emails that can't be pulled back using
> > > the username like searching for to:chubbard where the email was
> > > addressed to chubbard@somedomain.com, but it fails to show up in the
> search results.
> >  I
> > > can't explain why that's happening.  In all of my tests I can't
> > > reproduce it and I think I might have to reindex everything because
> > > this was an
> > index
> > > built with 2.4 and I upgraded to 3.1 so I'm worried it might be
> > corrupted.
> > >
> > > Thoughts?
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message