lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <>
Subject Re: StandardAnalyzer functionality change
Date Wed, 24 Oct 2012 14:56:31 GMT
Yes, by design. StandardAnalyzer implements "simple word boundaries" (the 
technical term is "Unicode text segmentation"), period. As the javadoc says, 
"As of Lucene version 3.1, this class implements the Word Break rules from 
the Unicode Text Segmentation algorithm, as specified in Unicode Standard 
Annex #29." That is a "standard".


-- Jack Krupansky

-----Original Message----- 
From: kiwi clive
Sent: Wednesday, October 24, 2012 6:42 AM
Subject: StandardAnalyzer functionality change

Hi all,

Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0 
and I see StandardAnalyzer has changed its behaviour, particularly when 
tokenizing email addresses. From reading the forums, I understand 
StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?

If I pass the string '' through these analyzers, I get the 
following tokens:

Using StandardAnalyzer(Version.LUCENE_23):  --> (one token)

Using StandardAnalyzer(Version.LUCENE_36):  -->  user    (two 
Using ClassicAnalyzer(Version.LUCENE_36):     -->  (one 

StandardAnalyzer is normally a good compromise as a default analyzer but the 
failure to keep an email address intact makes it less fit for purpose than 
it used to be. Is this a bug or is it by design ?  If by design, what is the 
reason for the change and is ClassicAnalyzer now the defacto-analyzer to use 


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message