lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Davids (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-5986) Don't allow runaway queries from harming Solr cluster health or search performance
Date Mon, 16 Jun 2014 22:10:02 GMT

     [ https://issues.apache.org/jira/browse/SOLR-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Steve Davids updated SOLR-5986:
-------------------------------

    Description: 
The intent of this ticket is to have all distributed search requests stop wasting CPU cycles
on requests that have already timed out or are so complicated that they won't be able to execute.
We have come across a case where a nasty wildcard query within a proximity clause was causing
the cluster to enumerate terms for hours even though the query timeout was set to minutes.
This caused a noticeable slowdown within the system which made us restart the replicas that
happened to service that one request, the worst case scenario are users with a relatively
low zk timeout value will have nodes start dropping from the cluster due to long GC pauses.

[~amccurry] Built a mechanism into Apache Blur to help with the issue in BLUR-142 (see commit
comment for code, though look at the latest code on the trunk for newer bug fixes).

Solr should be able to either prevent these problematic queries from running by some heuristic
(possibly estimated size of heap usage) or be able to execute a thread interrupt on all query
threads once the time threshold is met. This issue mirrors what others have discussed on the
mailing list: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3C856ac15f0903272054q2dbdbd19kea3c5ba9e105b9d8@mail.gmail.com%3E

  was:
The intent of this ticket is to have all distributed search requests stop wasting CPU cycles
on requests that have already timed out. We have come across a case where a nasty wildcard
query within a proximity clause was causing the cluster to enumerate terms for hours even
though the query timeout was set to minutes. This caused a noticeable slowdown within the
system which made us restart the replicas that happened to service that one request.

[~amccurry] Built a mechanism into Apache Blur to help with the issue in BLUR-142 (see commit
comment for code, though look at the latest code on the trunk for newer bug fixes).

Ideally Solr will distribute the timeout request parameter and automatically interrupt all
query threads once the threshold is met. This issue mirrors what others have discussed on
the mailing list: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3C856ac15f0903272054q2dbdbd19kea3c5ba9e105b9d8@mail.gmail.com%3E

        Summary: Don't allow runaway queries from harming Solr cluster health or search performance
 (was: When a query times out all distributed searches shouldn't continue on until completion)

As a follow up, we are still experiencing frequent issues with this specific issue which is
getting more and more frequent. Upon further research it looks like this is a somewhat common
problem that afflicts various Lucene community members. As noted in the description Apache
Blur has implemented a mechanism for coping but more recently Elastic Search has also implemented
their own solution which performs an up-front query heap estimation and will pull the "circuit
breaker" if it exceeds a threshold, thus not allowing the query to crash their cluster.

Documentation: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html#fielddata-circuit-breaker
Ticket: https://github.com/elasticsearch/elasticsearch/issues/2929 & https://github.com/elasticsearch/elasticsearch/pull/4261

If anyone has any suggestions on how we can limp by for the time being that would also be
greatly appreciated (unfortunately our user base needs to keep using nested proximity wildcards
but willing to have mechanisms in place to a kill subset of problematic queries).

> Don't allow runaway queries from harming Solr cluster health or search performance
> ----------------------------------------------------------------------------------
>
>                 Key: SOLR-5986
>                 URL: https://issues.apache.org/jira/browse/SOLR-5986
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Steve Davids
>            Priority: Critical
>             Fix For: 4.9
>
>
> The intent of this ticket is to have all distributed search requests stop wasting CPU
cycles on requests that have already timed out or are so complicated that they won't be able
to execute. We have come across a case where a nasty wildcard query within a proximity clause
was causing the cluster to enumerate terms for hours even though the query timeout was set
to minutes. This caused a noticeable slowdown within the system which made us restart the
replicas that happened to service that one request, the worst case scenario are users with
a relatively low zk timeout value will have nodes start dropping from the cluster due to long
GC pauses.
> [~amccurry] Built a mechanism into Apache Blur to help with the issue in BLUR-142 (see
commit comment for code, though look at the latest code on the trunk for newer bug fixes).
> Solr should be able to either prevent these problematic queries from running by some
heuristic (possibly estimated size of heap usage) or be able to execute a thread interrupt
on all query threads once the time threshold is met. This issue mirrors what others have discussed
on the mailing list: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3C856ac15f0903272054q2dbdbd19kea3c5ba9e105b9d8@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message