lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Desilets, Alain" <Alain.Desil...@nrc-cnrc.gc.ca>
Subject RE: [lucy-user] Can lucy do substring search?
Date Thu, 02 Feb 2012 13:40:57 GMT
Thx Peter. In my case, the fields on which I need to do wild-card searches are fields that
specify the URL of a document. I want to be able to use this to limit the search to documents
which are on specific web sites.

It seems the best balance in that case, between accuracy and speed, would be to tokenize on
non word character. Then, I could retrieve a superset of docs on say, www.somewhere.org, by
searching for "www.somewhere.org" (with a QueryParser). This might accidentally retrieve docs
whose urls contain www/somewhwere/org (for example), but I would do a second pass to filter
the docs whose url do not match the actual expression www.somewhere.org. I would need to do
this second pass anyway, even if I was using a WildCard search, because, I might accidentally
match a URL that has www.somewhere.org in a different part than the IP name (ex: http:/www.aplace.com/www.somewhere.org.html).

Alain

-----Original Message-----
From: Peter Karman [mailto:peter@peknet.com] 
Sent: Wednesday, February 01, 2012 9:23 PM
Cc: 'lucy-user@incubator.apache.org'
Subject: Re: [lucy-user] Can lucy do substring search?

Desilets, Alain wrote on 2/1/12 10:15 AM:
> Thx Peter. Would this encur the same performance problem as tokenizing the string on
a character by character basis?

WildcardQuery is slower than a TermQuery. It's all at search time though,
whereas tokenizing the string on a character basis happens at index time and
search time.

Your use case will incur a performance hit no matter what. In my apps, I
tokenize substrings for only particular fields at index time, and do some term
expansion instead of wildcards using a custom lexicon at search time. IME, it's
about finding a balance in your architecture to best fit your actual use cases.
Accuracy vs speed, is one balance to find. The use case you described (finding
all docs with a field matching a particular hostname) could be accomplished with
no change in indexing or tokenizing, if you used the WildcardQuery; whether that
proves too slow depends on your requirements. Try it and see.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Mime
View raw message