lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: Preventing "killer" queries
Date Wed, 08 Feb 2006 12:58:51 GMT
Thanks for the comments, Chris/Doug.

Chris, although I suggested it initially, I'm now a
little uncomfortable in controlling this issue with a
static variable in TermQuery because it doesnt let me
have different settings for different queries, indexes
or fields.
Doug, I'd ideally like to optimize for this condition
in advance rather than get into trouble and throw
exceptions to blow out queries.

I like to think of the ideal solution as a control
which automatically identifies and tunes out what it
sees as stop words but is controllable on a per index,
per field and per query basis, if needs be. 

The analyzer seemed a reasonably flexible way to do
this.

I tried looking at performance of Filter vs Query on a
1million doc index as per Chris's suggestion and found
that RangeFilter.bits() does improve on
search.search(TermQuery) and that this improvement was
a constant factor as df increases. The filter.bits
call was typically 60% of the equivalent TermQuery
search time for a range of tested DFs. However, both
filter and query response times increase in a linear
fashion with increases in df so I suspect they are
both ultimately heading for trouble as data volumes
increase - just that TermQuery gets there sooner than
filter.
I'd rather head this problem off sooner by
stop-wording very common terms in large indexes using
the analyzer. Obviously this wouldn't catch
Range/Fuzzy queries which expand at rewrite time but
at large levels of data you have to manage those types
of query carefully anyway.

I did come across a bizarre anomaly I would be
interested to have explained. A RangeFilter based on a
single term with 50% df responds in the same time as a
RangeFilter on a different field for a term with the
same df.
When it comes to TermQuerys though, not all fields are
equal. Using a TermQuery on a "free text" field with
many values for a single term with 50% df takes half
the time of a TermQuery on a constrained field
("doctype") for a single term with similar df. The
doctype field only ever has one of 6 possible values. 
Both queries are on the same index, and similar df
values. The relative performance difference was the
same for other DFs I tested across the 2 fields.
What is going on here? If anything, I might have
expected the open-ended field to be slower.

Cheers,
Mark


	
	
		
___________________________________________________________ 
Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message