lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Causse <no...@laposte.net>
Subject TimeLimitingCollector accuracy
Date Wed, 21 Dec 2016 12:27:51 GMT
Hi,

This subject has been discussed in the past but I don't think that any 
real solution was implemented yet.

Here is a small test case to illustrate the problem: 
https://github.com/nomoa/lucene-solr/commit/2f025b18899038c8606da64c2cf9f4e1f643607f#diff-65ae49ceb38e45a3fc05115be5e61a2dR387

This test will print:

Time waited on a slow query that matches all docs: 1109
Time waited on a slow query that matches no docs: 137258

The problem is that the time check is "passive", meaning that on large 
segments if the query is slow and matches no documents the timeout is 
very inaccurate making it nearly impossible to adjust client timeout vs 
collector timeout.

It happens to me where I have a query that implements a TwoPhaseIterator 
with an approximation that can be really bad not to say completely wrong 
(regex search on stored content with an approximation based on extracted 
tri-grams).

Another problem I discovered is that if the query is accepted by the 
QueryCache it will eagerly set its bitset bypassing the Collector.

Reading 
https://www.mail-archive.com/java-dev@lucene.apache.org/msg25694.html I 
see that one suggested solution was to move the timeout check at a lower 
level (in the scorers) but it raised some concerns about checking the 
timeout too frequently.

But given that some efforts have been done to separate sub scorers from 
"top-level" scorers (see 
https://issues.apache.org/jira/browse/LUCENE-5487) would it make sense 
now to make BulkScorers aware of some time constraints?

On my side, as a workaround to prevent catastrophes I'll probably 
continue to implement a circuit breaker in my TwoPhaseIterator#matches 
to either stop doing costly operation by returning false or by throwing 
an exception.

Lastly, I think it could help me to workaround this problem if the 
constructor of TimeExceededException was public, are there any reasons 
for this constructor to be private? Would it break important workflows 
if a scorer starts to throw this exception? It'd allow me to still 
return partial results.

Thanks for your help


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message