lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Morus Walter <morus.wal...@tanto.de>
Subject Re: jaspq: dashed numerical values tokenized differently
Date Tue, 02 Nov 2004 08:21:15 GMT
Daniel Taurat writes:
> Hi,
> I have just another stupid parser question:
> There seems to be a special handling of the dash sign "-" different from
> Lucene 1.2 at least in Lucene 1.4.RC3
> StandardAnalyzer.
> 
> Examples (1.4RC3):
> 
> A document containing the string "dash-test" is matched by the following
> search expressions:
> dash
> test
> dash*
> dash-test
> It is _not_ matched by the following search expressions:
> dash-*
> dash-t*
> 
> If the string after the dash consists of digits, the behavior is
> different.
> E.g., a document containing the string "dash-123" is matched by:
> dash*
> dash-*
> dash-123
> It is not matched by:
> dash
> 123
> 
> Question:
> Is this, esp. the different behavior when parsing digits and characters,
> intentional and how can it be explained?
> Regards,
> 
Query parser was changed to treat '-' within words as part of the word.
Before that change a query 'dash-test' was parsed as 'dash AND NOT test'.
Now QP reads one word 'dash-test' which is analyzed. If the analyzer
splits that to more than one token (standard analyzer does) a phrase
query is created.
The difference you see comes from standard analyzer which tokenizes
dash-test dash-123 to tokens dash, test and dash-123.
Prefix queries aren't analyzed.

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message