lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Can use Lucene be used for this
Date Thu, 13 Nov 2003 11:09:25 GMT
On Thursday, November 13, 2003, at 03:22  AM, Hackl, Rene wrote:
> documents contain very long strings for chemical substances, users are
> interested in certain parts of the string e.g. find all documents that
> comprise "*foo*" be it "1-foo-bar" or "rab-oof-13-foonyl-naphthalene").

So you're saying you want users to be able to search for "of-13" and 
match that second one?  User's really are demanding that?

> Suggestions on improvements are always welcome! :-)

It seems like some very clever tokenization during analysis is what 
you're after.  If you tokenized by dash (yes, you mentioned in the next 
message it is more complex than that, but just for this example let's 
simplify it to that), then the first document would have "1", "foo", 
"bar", and the second would have "rab", "oof", "13", "foonyl", and 
"naphthalene".

A PrefixQuery (not even a WildcardQuery) for "foo" would find both.

Now suppose the users want to search for "oo" and find both documents.  
First, I'd probably argue that this doesn't really make sense given the 
domain.

But, keep in mind that WildcardQuery itself does support "*oo*" and it 
would work as expected (although with the performance caveat if the 
index is huge).  If you want QueryParser to support a leading wildcard 
character, you would have to customize it yourself.

Another, perhaps ridiculous, alternative is to index each sequence of 
characters for each piece as tokens too: "f", "fo", "foo", "foon", 
"foony", "foonyl", "o", "oo", "oon".... and so on.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message