lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: StandardAnalyzer functionality change
Date Wed, 24 Oct 2012 20:06:03 GMT
s/work break/word break/

-- Jack Krupansky

-----Original Message----- 
From: Jack Krupansky
Sent: Wednesday, October 24, 2012 3:52 PM
To: java-user@lucene.apache.org ; kiwi clive
Subject: Re: StandardAnalyzer functionality change

I didn't explicitly say it, but ClassicAnalyzer does do exactly what you
want it to do - work break plus email and URL, or StandardAnalyzer plus
email and URL.

-- Jack Krupansky

-----Original Message----- 
From: kiwi clive
Sent: Wednesday, October 24, 2012 1:27 PM
To: java-user@lucene.apache.org
Subject: Re: StandardAnalyzer functionality change

Thanks for the responses chaps, very informative, and most appreciated :-)





________________________________
From: Ian Lea <ian.lea@gmail.com>
To: java-user@lucene.apache.org
Sent: Wednesday, October 24, 2012 4:19 PM
Subject: Re: StandardAnalyzer functionality change

If you want email addresses, UAX29URLEmailAnalyzer is another alternative.


--
Ian.


On Wed, Oct 24, 2012 at 3:56 PM, Jack Krupansky <jack@basetechnology.com>
wrote:
> Yes, by design. StandardAnalyzer implements "simple word boundaries" (the
> technical term is "Unicode text segmentation"), period. As the javadoc 
> says,
> "As of Lucene version 3.1, this class implements the Word Break rules from
> the Unicode Text Segmentation algorithm, as specified in Unicode Standard
> Annex #29." That is a "standard".
>
> See:
> http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
> http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html
>
> -- Jack Krupansky
>
> -----Original Message----- From: kiwi clive
> Sent: Wednesday, October 24, 2012 6:42 AM
> To: java-user@lucene.apache.org
> Subject: StandardAnalyzer functionality change
>
>
> Hi all,
>
> Sorry if I'm asking an age old question but we have migrated to lucene 
> 3.6.0
> and I see StandardAnalyzer has changed its behaviour, particularly when
> tokenizing email addresses. From reading the forums, I understand
> StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?
>
>
> If I pass the string 'user@domain.com' through these analyzers, I get the
> following tokens:
>
> Using StandardAnalyzer(Version.LUCENE_23):  -->  user@domain.com (one 
> token)
>
> Using StandardAnalyzer(Version.LUCENE_36):  -->  user domain.com    (two
> tokens)
> Using ClassicAnalyzer(Version.LUCENE_36):     -->  user@domain.com  (one
> token)
>
> StandardAnalyzer is normally a good compromise as a default analyzer but 
> the
> failure to keep an email address intact makes it less fit for purpose than
> it used to be. Is this a bug or is it by design ?  If by design, what is 
> the
> reason for the change and is ClassicAnalyzer now the defacto-analyzer to 
> use
> ?
>
> Thanks,
> Clive
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message