lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@googlemail.com>
Subject Re: Strange behaviour of StandardTokenizer
Date Fri, 18 Jun 2010 07:52:27 GMT
Hi Anna,

what are you using you tokenizer for? There are a lot of different
options in lucene an StandardTokenizer is not necessarily the best
one. The behaviour you are see is that the tokenizer detects you token
as a number. When you look at the grammar that is kind of obvious.

<snip>
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
           | {HAS_DIGIT} {P} {ALPHANUM}
           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)

// punctuation
P	         = ("_"|"-"|"/"|"."|",")

</snip>

you can either build your own custom filter which fixed only the
problem with numbers containing a '- ', use the MappingCharFilter or
switch to a different tokenizer.
If you could talk more about your usecase you might get better suggestions.

Simon

On Fri, Jun 18, 2010 at 9:03 AM, Anna Hunecke <annahunecke@yahoo.de> wrote:
> Hi Ahmet,
> thanks for the explanation. :)
> okay, so it is recognized as a number? I didn't expect that really. I expect that all
words are either split at the minus or not.
> Maybe I'll have to use another tokenizer.
> Best,
> Anna
>
> --- Ahmet Arslan <iorixxx@yahoo.com> schrieb am Do, 17.6.2010:
>
>> Von: Ahmet Arslan <iorixxx@yahoo.com>
>> Betreff: Re: Strange behaviour of StandardTokenizer
>> An: java-user@lucene.apache.org
>> Datum: Donnerstag, 17. Juni, 2010 15:50 Uhr
>>
>> > I ran into a strange behaviour of the
>> StandardTokenizer.
>> > Terms containing a '-' are tokenized differently
>> depending
>> > on the context.
>> > For example, the term 'nl-lt' is split into 'nl' and
>> 'lt'.
>> > The term 'nl-lt0' is tokenized into 'nl-lt0'.
>> > Is this a bug or a feature?
>>
>> It is designed that way. TypeAttribute of those tokens are
>> different.
>>
>> > Can I avoid it somehow?
>>
>> Do you want to split at '-' char no matter what? If yes,
>> you can replace all '-' characters with whitespace using
>> MappingCharFilter before StandardTokenizer.
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message