lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: SubstringQuery -- Re: Leading Wild Card Search
Date Tue, 17 Feb 2004 18:16:19 GMT
David Spencer wrote:
> 2 files attached, SubstringQuery (which you'll use) and 
> SubstringTermEnum ( used by the former to be
> consistent w/ other Query code).
> I find this kind of query useful to have and think that the query parser 
> should allow it in spite of the perception
> of this being slow, however I think the debate is the "user centric 
> view" (say mine, allow substring queries)
> vs the "protect the engines performance" view which says not to allow 
> expensive queries.

I think the argument is more complex.

One issue is cost of execution: very slow queries can be used to 
implement a denial-of-service attack.  Maybe that's an overstatement, 
but in a web server setting, once a few of slow searches are running, no 
others may complete.  When folks hit "Stop" in their browser the server 
does not stop processing the query.  If they hit "Reload" then another 
new search is started.  So these can be very problematic.  This is real. 
  Lots of folks have deployed Lucene with large indexes and then found 
that their server randomly crashes.  Closer scrutiny shows that they 
were permitting operators that are too slow for their combination of 
index size and query traffic.  The BooleanQuery.TooManyClauses exception 
was added to address this, but it can still be too late, if the problem 
is caused before the query is built, e.g., while enumerating all terms.

A releated issue is that users (and even most developers) don't 
understand the relative costs of different query operators.  Some things 
are fast, others are surprisingly slow.  That's not a great user 
experience, and triggers problems like those described above.  People 
think that the rare slow cases are network problems or something, and 
hit "Reload".

I have no problem with including slow operators with Lucene, but they 
should be well documented as such, at least for developers.  Perhaps we 
should make a pass through the existing Query classes, in particular 
those which expand into other queries, and add some performance notes, 
so that folks don't blindly start using things which may bite them.  By 
default I think it would be safest if the QueryParser only permitted 
operators which are efficient.  Folks can then, at their own risk, 
enable other operators.

In summary, removing operators can be user-centric, if it removes 
unpredictablity.  And the reason for protecting engine performance is 
not miserly, it's to guarantee availablility.  And finally, an issue 
dear to me, a predicatble search engine results in fewer spurious bug 
reports, saving developer time for real bugs.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message