lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joshua Humphries (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
Date Wed, 15 Mar 2017 13:52:41 GMT

    [ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926204#comment-15926204
] 

Joshua Humphries edited comment on SOLR-7191 at 3/15/17 1:51 PM:
-----------------------------------------------------------------

Our cluster has many thousands of collections, most of which have only a single shard and
single replica. Restarting a single node takes over two minutes in good circumstances (expected
restart, like during upgrades of solr or deployment of new/updated plugins). In bad circumstances,
like if machines appear wedged and leader election issues have already caused the overseer
queue to grow large, restarting a server can take over 10 minutes!

While watching the overseer queue size in our latest observation of this slowness, I saw that
the down node messages take *way* too long to process. I ended up tracking that to an issue
where it results in a ZK write for *every* collection, not just the collections that had shard-replicas
on that node. In our case, it was processing about 40 times too many collections, making a
rolling restart of the whole cluster effectively O(n^2) instead of O( n) in terms of the writes
to ZK.

See SOLR-10277.


was (Author: jhump):
Our cluster has many thousands of collections, most of which have only a single shard and
single replica. Restarting a single node takes over two minutes in good circumstances (expected
restart, like during upgrades of solr or deployment of new/updated plugins). In bad circumstances,
like if machines appear wedged and leader election issues have already caused the overseer
queue to grow large, restarting a server can take over 10 minutes!

While watching the overseer queue size in our latest observation of this slowness, I saw that
the down node messages take *way* too long to process. I ended up tracking that to an issue
where it results in a ZK write for *every* collection, not just the collections that had shard-replicas
on that node. In our case, it was processing about 40 times too many collections, making a
rolling restart of the whole cluster effectively O(n^2) instead of O(n) in terms of the writes
to ZK.

See SOLR-10277.

> Improve stability and startup performance of SolrCloud with thousands of collections
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-7191
>                 URL: https://issues.apache.org/jira/browse/SOLR-7191
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.0
>            Reporter: Shawn Heisey
>            Assignee: Noble Paul
>              Labels: performance, scalability
>             Fix For: 6.3
>
>         Attachments: lots-of-zkstatereader-updates-branch_5x.log, SOLR-7191.patch, SOLR-7191.patch,
SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0)
is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many problems
myself even before I was able to get 4000 collections created on a 5.0 example cloud setup.
 Restarting Solr takes a very long time, and it is not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance and scalability.
 It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr
-e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message