lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Wartes <>
Subject Re: SolrCloud - Strategy for recovering cluster states
Date Tue, 01 Mar 2016 21:36:15 GMT

I’ve been running SolrCloud clusters in various versions for a few years here, and I can
only think of two or three cases that the ZK-stored cluster state was broken in a way that
I had to manually intervene by hand-editing the contents of ZK. I think I’ve seen Solr fixes
go by for those cases, too. I’ve never completely wiped ZK. (Although granted, my ZK cluster
has been pretty stable, and my collection count is smaller than yours)

My philosophy is that ZK is the source of cluster configuration, not the collection of
files on the nodes. 
Currently, cluster state is shared between ZK and core directories. I’d prefer, and I think
Solr development is going this way, (SOLR-7269) that all cluster state exist and be managed
via ZK, and all state be removed from the local disk of the cluster nodes. The fact that a
node uses local disk based configuration to figure out what collections/replicas it has is
something that should be fixed, in my opinion.

If you’re frequently getting into bad states due to ZK issues, I’d suggest you file bugs
against Solr for the fact that you got into the state, and then fix your ZK cluster.

Failing that, can you just periodically back up your ZK data and restore it if something breaks?
I wrote a little tool to watch clusterstate.json and write every version to a local git repo
a few years ago. I was mostly interested because I wanted to see changes that happened pretty
fast, but it could also serve as a backup approach. Here’s a link, although I clearly haven’t
touched it lately. Feel free to ask if you have issues:

On 3/1/16, 12:09 PM, "danny teichthal" <> wrote:

>Just summarizing my questions if the long mail is a little intimidating:
>1. Is there a best practice/automated tool for overcoming problems in
>cluster state coming from zookeeper disconnections?
>2. Creating a collection via core admin is discouraged, is it true also for
> discovery?
>I would like to be able to specify collection.configName in the
> and when starting server, the collection will be created
>and linked to the config name specified.
>On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal <>
>> Hi,
>> I would like to describe a process we use for overcoming problems in
>> cluster state when we have networking issues. Would appreciate if anyone
>> can answer about what are the flaws on this solution and what is the best
>> practice for recovery in case of network problems involving zookeeper.
>> I'm working with Solr Cloud with version 5.2.1
>> ~100 collections in a cluster of 6 machines.
>> This is the short procedure:
>> 1. Bring all the cluster down.
>> 2. Clear all data from zookeeper.
>> 3. Upload configuration.
>> 4. Restart the cluster.
>> We rely on the fact that a collection is created on core discovery
>> process, if it does not exist. It gives us much flexibility.
>> When the cluster comes up, it reads from and creates the
>> collections if needed.
>> Since we have only one configuration, the collections are automatically
>> linked to it and the cores inherit it from the collection.
>> This is a very robust procedure, that helped us overcome many problems
>> until we stabilized our cluster which is now pretty stable.
>> I know that the leader might change in such case and may lose updates, but
>> it is ok.
>> The problem is that today I want to add a new config set.
>> When I add it and clear zookeeper, the cores cannot be created because
>> there are 2 configurations. This breaks my recovery procedure.
>> I thought about a few options:
>> 1. Put the config Name in - this doesn't work. (It is
>> supported in CoreAdminHandler, but  is discouraged according to
>> documentation)
>> 2. Change recovery procedure to not delete all data from zookeeper, but
>> only relevant parts.
>> 3. Change recovery procedure to delete all, but recreate and link
>> configurations for all collections before startup.
>> Option #1 is my favorite, because it is very simple, it is currently not
>> supported, but from looking on code it looked like it is not complex to
>> implement.
>> My questions are:
>> 1. Is there something wrong in the recovery procedure that I described ?
>> 2. What is the best way to fix problems in cluster state, except from
>> editing clusterstate.json manually? Is there an automated tool for that? We
>> have about 100 collections in a cluster, so editing is not really a
>> solution.
>> 3.Is creating a collection via is also discouraged?
>> Would very appreciate any answers/ thoughts on that.
>> Thanks,
View raw message