lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <>
Subject Re: Strange behaviour of StandardTokenizer
Date Thu, 17 Jun 2010 13:50:13 GMT

> I ran into a strange behaviour of the StandardTokenizer.
> Terms containing a '-' are tokenized differently depending
> on the context. 
> For example, the term 'nl-lt' is split into 'nl' and 'lt'.
> The term 'nl-lt0' is tokenized into 'nl-lt0'.
> Is this a bug or a feature? 

It is designed that way. TypeAttribute of those tokens are different.

> Can I avoid it somehow?

Do you want to split at '-' char no matter what? If yes, you can replace all '-' characters
with whitespace using MappingCharFilter before StandardTokenizer. 


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message