lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Preventing "killer" queries
Date Tue, 07 Feb 2006 18:28:24 GMT
I've just been doing some benchmarking on a reasonably
large-scale system (38 million docs) and ran into an
issue where certain *very* common terms would
dramatically slow query responses. 
Some terms were abnormally common because I had
constructed the index by taking several copies and
merging them. Address data from this small sample area
had the county name reproduced massively.
Consequently a termQuery for the county name (with 50%
docFreq) in a scaled-up 38m doc index took 2 seconds
to return whereas most "normal" terms (<10% df) took a
matter of milliseconds.

Of course the solution for most situations is to use a
stop-word list at index time but that requires some
manual configuration and prior knowledge of the data
which isn't always ideal.

For these outlier situations is it worth adding a
"maxDf" property to TermQuery like BooleanQuery's
maxClause query-time control? I could fix my problem
in my own app-specific query construction code but I
wonder if others would find it a useful fix to add to
TermQuery in the Lucene core?


Cheers,
Mark






		
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre.
http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message