lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)
Date Wed, 01 Oct 2014 05:01:54 GMT
Hi Paul,

StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation Standard
Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version supported
by Lucene 4.1.0: <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.

Only those sequences between boundaries that contain letters and/or digits are returned as
tokens; all other sequences between boundaries are skipped over and not returned as tokens.

Steve

On Sep 30, 2014, at 3:54 PM, Paul Taylor <paul_t100@fastmail.fm> wrote:

> Does StandardTokenizer remove punctuation (in Lucene 4.1)
> 
> Im just trying to move back to StandardTokenizer from my own old custom implemenation
because the newer version seems to have much better support for Asian languages
> 
> However this code except fails on incrementToken() implying that the !!! are removed
from output, yet looking at the jflex classes I cant see anything to indicate punctuation
is removed, is it removed and if so can i remove it ?
> 
> Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, new StringReader("!!!"));
> assertNotNull(tokenizer);
> tokenizer.reset();
> assertTrue(tokenizer.incrementToken());
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message