lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Taurat" <daniel.tau...@gaussvip.com>
Subject RE: jaspq: dashed numerical values tokenized differently
Date Wed, 03 Nov 2004 15:21:01 GMT


> Give me an example of a string and how you'd like it to be tokenized.
> But first, give the AnalyzerUtils (from my java.net article) a try and
> get a feel for what different analyzers do.
> 
> Keep in mind that it can be tricky (see the AnalysisParalysis page on
> the wiki and my java.net article on QueryParser) to make sense out of
a
> combination of QueryParser and an Analyzer - so its best to work with
> them independently to get what you want and then put things together.

I already used Luke: 
This is what I found (making sense to me even :)))
String dash-123-01
Was tokenized with 1.2 StandardAnalyzer 
dash
123
01

and is tokenized (1.4RC4) with any other than RussianAnalyser,
simpleAnalyzer and StopAnalyzer (which just got dash and omitted all
numbers)

dash-123-01

On the other hand

dash-my-string

is tokenized 

dash
my
string

by all of them except whitespaceAnalyser, of course.

I guess this is what happens: numerical components turn the meaning of
the preceding dash into a minus. With that, it is part of the token with
the digits in it and no longer a separator. This is even for mixed terms
like 123a-01. So -1andAnyOtherCharacters-evenWithDashes is an
non-separable numerical expression for Lucene.

Checked with Luke on the string
dash\-123\-01 

and got

dash
123
01

with germanAnalyzer and standardAnalyzer
and

dash

with all the other, except for whitespaceAnalyser, of course.


This makes me think that an escaped dash is never a minus, somehow.

Daniel





---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message