lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anna Hunecke <annahune...@yahoo.de>
Subject Re: Strange behaviour of StandardTokenizer
Date Fri, 18 Jun 2010 07:03:35 GMT
Hi Ahmet,
thanks for the explanation. :)
okay, so it is recognized as a number? I didn't expect that really. I expect that all words
are either split at the minus or not.
Maybe I'll have to use another tokenizer.
Best,
Anna

--- Ahmet Arslan <iorixxx@yahoo.com> schrieb am Do, 17.6.2010:

> Von: Ahmet Arslan <iorixxx@yahoo.com>
> Betreff: Re: Strange behaviour of StandardTokenizer
> An: java-user@lucene.apache.org
> Datum: Donnerstag, 17. Juni, 2010 15:50 Uhr
> 
> > I ran into a strange behaviour of the
> StandardTokenizer.
> > Terms containing a '-' are tokenized differently
> depending
> > on the context. 
> > For example, the term 'nl-lt' is split into 'nl' and
> 'lt'.
> > The term 'nl-lt0' is tokenized into 'nl-lt0'.
> > Is this a bug or a feature? 
> 
> It is designed that way. TypeAttribute of those tokens are
> different.
> 
> > Can I avoid it somehow?
> 
> Do you want to split at '-' char no matter what? If yes,
> you can replace all '-' characters with whitespace using
> MappingCharFilter before StandardTokenizer. 
> 
> 
>       
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message