lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: StandardAnalyzer and Email Addresses
Date Sun, 26 Feb 2012 14:12:29 GMT
There is no Analyzer implementation because no one ever made one :).  Copy-pasting StandardAnalyzer
and substituting UAX29URLEmailTokenizer wherever StandardTokenizer appears should do the trick.

Because people often want to be able to search against *both* whole email addresses and URLs
*and* their components, a UAX29URLEmailAnalyzer would ideally have filter(s) to emit email/URL
components at the same position as the full term.  Or rather, the reverse: each component
would have its own position, and the full term would be positioned at the head component's
position.

Steve

> -----Original Message-----
> From: Uwe Schindler [mailto:uwe@thetaphi.de]
> Sent: Sunday, February 26, 2012 3:51 AM
> To: java-user@lucene.apache.org
> Subject: RE: StandardAnalyzer and Email Addresses
> 
> Hi,
> 
> If you want a Tokenizer for your Analyzer that supports eMail detection,
> use
> UAX29URLEmailTokenizer (see http://goo.gl/evH97). There is no Analyzer
> available that uses this Tokenizer, but you can define your own one like
> StandardAnalyzer, but with this class as Tokenizer (not
> StandardTokenizer).
> I am not sure why there is no Analyzer implementation already available,
> maybe Steven Rowe knows more.
> 
> The trick with the phrase is of lower performance as it uses a PhraseQuery
> internally, which is more expensive.
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> 
> > -----Original Message-----
> > From: Charlie Hubbard [mailto:charlie.hubbard@gmail.com]
> > Sent: Sunday, February 26, 2012 1:51 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: StandardAnalyzer and Email Addresses
> >
> > I am using StandardAnalyzer in 3.1.  I'd been previously using 2.4 and
> from that
> > documentation it states email address are recognized:
> >
> > http://javasourcecode.org/html/open-source/lucene/lucene-
> > 2.4.0/org/apache/lucene/analysis/standard/StandardTokenizer.html
> >
> > It looks like this was changed in 3.x according to this doc now:
> >
> >
> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/all/or
> g/
> > apache/lucene/analysis/standard/ClassicTokenizer.html
> >
> > I think I've found a work around in that if I search for email address
> like:
> >
> > to:"charlie.hubbard@gmail.com"
> >
> > Then it will look for the full email address.  What is the draw back of
> using the
> > quoted version?  Is the performance worse doing this?  How much worse?
> I'm
> > not sure how quoted searches are implemented so it's hard for me to
> gauge
> > what the draw back is.
> >
> > Thanks
> > Charlie
> >
> > On Mon, Feb 20, 2012 at 12:23 PM, Ian Lea <ian.lea@gmail.com> wrote:
> >
> > > Are you using StandardAnalyzer in 3.1+?  You may want to use
> > > ClassicAnalyzer instead.  I can't see where in the 3.5 javadocs it
> > > says that email addresses are recognized, but it does sound vaguely
> > > familiar.
> > >
> > >
> > > --
> > > Ian.
> > >
> > >
> > > On Thu, Feb 16, 2012 at 5:18 PM, Charlie Hubbard
> > > <charlie.hubbard@gmail.com> wrote:
> > > > This is a pretty simple question to answer, but I have customers
> > > > asking
> > > me
> > > > how this is suppose to work and I'm having trouble explaining it.  I
> > > > have an app that indexes emails so there are plenty of email
> > > > addresses in
> > > there.
> > > >  Reading the StandardAnalyzer javadoc it says it "recognizes" email
> > > > addresses when it is creating the token list.  What tokens will it
> > > produce
> > > > exactly?  What I'm seeing when I perform searches is the email
> > > > address looks like its being tokenized into its parts.  Searching by
> > > > an email address like:
> > > >
> > > > to:charlie.hubbard@gmail.com
> > > >
> > > > pulls back more hits that haven't been addressed to
> > > > charlie.hubbard@gmail.com.  Other messages with gmail.com in them
> > > > are returned.  If I use the following:
> > > >
> > > > to:charlie.hubbard
> > > >
> > > > in them.  It also finds gmail.com, and other domains.  And I can
> > > > search
> > > for
> > > > strings like
> > > >
> > > > to:"charlie.hubbard@gmail.com"
> > > >
> > > > it will pull back only emails addressed to that address.  Further
> > > > proof
> > > it
> > > > seems to token the parts of an email is if I search for a very
> > > > specific email address like:
> > > >
> > > > to:"charlie.hubbard+sometag"
> > > >
> > > > That will pull back only emails addressed to that email, but it's
> > > > not a full email address.  Which leads me to think it will parse
> > > > parts of the email addresses.  Can someone explain this a little
> more?
> > > >
> > > > I'm having trouble with some emails that can't be pulled back using
> > > > the username like searching for to:chubbard where the email was
> > > > addressed to chubbard@somedomain.com, but it fails to show up in the
> > search results.
> > >  I
> > > > can't explain why that's happening.  In all of my tests I can't
> > > > reproduce it and I think I might have to reindex everything because
> > > > this was an
> > > index
> > > > built with 2.4 and I upgraded to 3.1 so I'm worried it might be
> > > corrupted.
> > > >
> > > > Thoughts?
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message