lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Per Steffensen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3582) Leader election zookeeper watcher is responding to con/discon notifications incorrectly.
Date Thu, 28 Jun 2012 12:03:44 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403041#comment-13403041
] 

Per Steffensen commented on SOLR-3582:
--------------------------------------

Trym didnt mention it, but this is not only a negligible problem that will never cause any
problems in real-world usage. Actually we discovered the problem during one of our performance/endurance
test of our real world application in a real world setup and with real world workload (high).
We are running with numerous Solr instances in a SolrCloud cluster, with numerous collections
each having about 25 slices each with 2 shards (one replica for each slice). During the test
Solrs lose their ZK connection (probably due to too long GC pause) and reconnect - resulting
in more watchers. The next time a dis-/re-connect to ZK happens it gets many watcher-events
resulting in even more watchers for the next time. All in all, seen from the outside, this
breaks our performance/endurance test - at first things starts to slow down and eventually
JVMs break down with OOM errors. This is a self-reinforcing problem, because for every iteration
more time has to be used by the garbage collector collecting watchers (twice as many as last
time), increasing the probability of new ZK timeouts, and more time has to be used creating
new watchers (twice as many as last time).

I think you should commit the fix. Basically because it makes a (our) real world application
able to run for a long time - it wasnt before. Commit the fix, not so much for our sake, because
we are using our own build of Solr (inkl this fix, other fixes and nice impl of optimistic
locking etc (SOLR-3173, SOLR-3178, etc)) anyway, but to save others (that might also be among
the "first movers" on using Solr 4.0 for high scale real world applications) from having to
use weeks tracking down the essence of this issue and make a fix.

If you think this observation/fix should lead to a walk through of the code, to check if watchers
are used undesirably at other places, and maybe even come to a more generic fix, I would endorse
such a task. But for now I urge you to commit to save others from weeks of debugging. If/when
you come to a better or more generic solution, you can always go refactor.

Regards, Per Steffensen
                
> Leader election zookeeper watcher is responding to con/discon notifications incorrectly.
> ----------------------------------------------------------------------------------------
>
>                 Key: SOLR-3582
>                 URL: https://issues.apache.org/jira/browse/SOLR-3582
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 4.0, 5.0
>
>
> As brought up by Trym R. Møller on the mailing list, we are responding to watcher events
about connection/disconnection as if they were notifications about node changes.
> http://www.lucidimagination.com/search/document/e13ef390b88eeee2

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message