lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pierrick Brihaye" <pierrick.brih...@wanadoo.fr>
Subject Re: derive tokens from single token
Date Mon, 29 Sep 2003 16:40:47 GMT
Hi,

>There'll be at least two indexes, one "normal" one and another bloated one.
>Dan suggested splitting, too, but, unfortunately, if users search for e.g.
>
>"9-Oxabicyclo[3.3.1]nona-2,6-diene"
>
>they don't want anything else than that substance, as opposed to
>
>"*-Oxabicyclo[3.3.1]nona*"
>
>where they'd be interested in substances from that family - whatever the
>numbers are.

As Erik says, it seems that you basically want a PhraseQuery.

At first glance you should tokenize on "-" to have the "words" of your
"phrase".

Then, when you have a number, i.e. "9", "[3.3.1]", "2,6", you could output
"dummy tokens" that would take *one* position, say "NUMBER_ALONE",
"NUMBER_BETWEEN_BACKETS", "COMMA_SEPARATED_NUMBERS"... or just "NUMBER".

When you process :

"hexahydronaphthalene" tokenize it as : "hexa hydro naphtal ene".
"heptahydronaphthalin" tokenize it as : "hepta hydro naphtal in".
The PhraseQuery "hydronaphtal", tokenized by the same tokenizer as "hydro
naphtal" will match both in a PhraseQuery.

Sorry, my knowledge in organic chemistry is far :-) and I wonder if these
considerations are not over-simplistic. Maybe your "same position" approach
(solved by a "token stack" algorithm) is the least worse one.

Exciting analysis problematics however,

Cherrs,

p.b.



Mime
View raw message