lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: StandardAnalyzer functionality change
Date Wed, 24 Oct 2012 21:26:30 GMT
Small correction: UAX29URLEmailAnalyzer = StandardAnalyzer + URL + Email. (Full support for
URLs with the file:, ftp:, and http/s: protocols; full email support.)

ClassicAnalyzer is a different beast altogether.  First of all, it doesn't implement Unicode
segmentation - it has a non-standard tokenizer that works okay for some English text.  It
does recognize some (maybe most?) email addresses, but not all of them (e.g. the '+' character,
a valid username char in email addresses, is not supported).  It does not recognize URLs,
but rather domain names, aka hostnames.

Steve

On Oct 24, 2012, at 3:52 PM, Jack Krupansky <jack@basetechnology.com> wrote:

> I didn't explicitly say it, but ClassicAnalyzer does do exactly what you want it to do
- work break plus email and URL, or StandardAnalyzer plus email and URL.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: kiwi clive
> Sent: Wednesday, October 24, 2012 1:27 PM
> To: java-user@lucene.apache.org
> Subject: Re: StandardAnalyzer functionality change
> 
> Thanks for the responses chaps, very informative, and most appreciated :-)
> 
> 
> 
> 
> 
> ________________________________
> From: Ian Lea <ian.lea@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Wednesday, October 24, 2012 4:19 PM
> Subject: Re: StandardAnalyzer functionality change
> 
> If you want email addresses, UAX29URLEmailAnalyzer is another alternative.
> 
> 
> --
> Ian.
> 
> 
> On Wed, Oct 24, 2012 at 3:56 PM, Jack Krupansky <jack@basetechnology.com> wrote:
>> Yes, by design. StandardAnalyzer implements "simple word boundaries" (the
>> technical term is "Unicode text segmentation"), period. As the javadoc says,
>> "As of Lucene version 3.1, this class implements the Word Break rules from
>> the Unicode Text Segmentation algorithm, as specified in Unicode Standard
>> Annex #29." That is a "standard".
>> 
>> See:
>> http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
>> http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: kiwi clive
>> Sent: Wednesday, October 24, 2012 6:42 AM
>> To: java-user@lucene.apache.org
>> Subject: StandardAnalyzer functionality change
>> 
>> 
>> Hi all,
>> 
>> Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0
>> and I see StandardAnalyzer has changed its behaviour, particularly when
>> tokenizing email addresses. From reading the forums, I understand
>> StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?
>> 
>> 
>> If I pass the string 'user@domain.com' through these analyzers, I get the
>> following tokens:
>> 
>> Using StandardAnalyzer(Version.LUCENE_23):  -->  user@domain.com (one token)
>> 
>> Using StandardAnalyzer(Version.LUCENE_36):  -->  user domain.com    (two
>> tokens)
>> Using ClassicAnalyzer(Version.LUCENE_36):     -->  user@domain.com  (one
>> token)
>> 
>> StandardAnalyzer is normally a good compromise as a default analyzer but the
>> failure to keep an email address intact makes it less fit for purpose than
>> it used to be. Is this a bug or is it by design ?  If by design, what is the
>> reason for the change and is ClassicAnalyzer now the defacto-analyzer to use
>> ?
>> 
>> Thanks,
>> Clive
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message