lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <>
Subject Re: token type question
Date Sat, 16 Apr 2005 21:16:03 GMT

Le 16 avr. 05, à 08:31, Pierrick Brihaye a écrit :
>> How do I search all the tokens with "chem" type token, such as H2O, 
>> O2, etc? Any sample like this? If this approach doesn't work, what's 
>> the best approach?

Nifty question... I'm working on indexing text with math formulae... 
there may be similarities !

> You may assign a type to the tokens, and then you may filter them 
> according to their type *but* the index forgets this info since it 
> stores *terms* (field/value pairs). [...]
> 1) use a dedicated field "chem" where only chemical content is allowed 
> (filter out every token whose type is different from "chem")
> 2) manipulate your termText : "chem_H2" ; the same for your queries
> 3) play with the query rather than with the index content : filter out 
> what is not chemical

So it really seems chem_H2 is the only choice, or ?

What's your requirements or expectations ?
- match a formula in the middle of a sentence ?
- or simply match documents that contain both the sentence's words and 
the formula (in the latter case, I think solution 1 is valid)
- how would you do wildcards with formulae ?

A related question, at least for me, is how to match a+(b+1) when the 
query is X+Y, ie. subtree cut.
Does this occur in chemical formulae as well?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message