lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Wartes <>
Subject Re: SolrCloud - Strategy for recovering cluster states
Date Wed, 02 Mar 2016 17:54:05 GMT
Well, with the understanding that someone who isn’t involved in the process is describing
something that isn’t built yet...

I could imagine changes like:
 - Core discovery ignores cores that aren’t present in the ZK cluster state
 - New cores are automatically created to bring a node in line with ZK cluster state (addreplica,
So if the clusterstate said “node XYZ has a replica of shard3 of collection1 and that’s
all”, and you downed node XYZ and deleted the data directory, it’d get restored when you
started the node again. And if you copied the core directory for shard1 of collection2 in
there and restarted the node, it’d get ignored because the clusterstate says node XYZ doesn’t
have that.

More importantly, if you completely destroyed a node and rebuilt it from an image, (AWS?)
that image wouldn't need any special core directories specific to that node. As long as the
node name was the same, Solr would handle bringing that node back to where it was in the cluster.

Back to opinions, I think mixing the cluster definition between local disk on the nodes and
ZK clusterstate is just confusing. It should really be one or the other. Specifically, I think
it should be local disk for non-SolrCloud, and ZK for SolrCloud.

On 3/2/16, 12:13 AM, "danny teichthal" <> wrote:

>Thanks Jeff,
>I understand your philosophy and it sounds correct.
>Since we had many problems with zookeeper when switching to Solr Cloud. we
>couldn't make it as a source of knowledge and had to relay on a more stable
>The issues is that when we get such an event of zookeeper, it brought our
>system down, and in this case, clearing the were a life
>We've managed to make it pretty stable not, but we will always need a
>"dooms day" weapon.
>I looked into the related JIRA and it confused me a little, and raised a
>few other questions:
>1. What exactly defines zookeeper as a truth?
>2. What is the role of if the state is only in zookeeper?
>Your tool is very interesting, I just thought about writing such a tool
>From the sources I understand that you represent each node as a path in the
>git repository.
>So, I guess that for restore purposes I will have to do
>the opposite direction and create a node for every path entry.
>On Tue, Mar 1, 2016 at 11:36 PM, Jeff Wartes <> wrote:
>> I’ve been running SolrCloud clusters in various versions for a few years
>> here, and I can only think of two or three cases that the ZK-stored cluster
>> state was broken in a way that I had to manually intervene by hand-editing
>> the contents of ZK. I think I’ve seen Solr fixes go by for those cases,
>> too. I’ve never completely wiped ZK. (Although granted, my ZK cluster has
>> been pretty stable, and my collection count is smaller than yours)
>> My philosophy is that ZK is the source of cluster configuration, not the
>> collection of files on the nodes.
>> Currently, cluster state is shared between ZK and core directories. I’d
>> prefer, and I think Solr development is going this way, (SOLR-7269) that
>> all cluster state exist and be managed via ZK, and all state be removed
>> from the local disk of the cluster nodes. The fact that a node uses local
>> disk based configuration to figure out what collections/replicas it has is
>> something that should be fixed, in my opinion.
>> If you’re frequently getting into bad states due to ZK issues, I’d suggest
>> you file bugs against Solr for the fact that you got into the state, and
>> then fix your ZK cluster.
>> Failing that, can you just periodically back up your ZK data and restore
>> it if something breaks? I wrote a little tool to watch clusterstate.json
>> and write every version to a local git repo a few years ago. I was mostly
>> interested because I wanted to see changes that happened pretty fast, but
>> it could also serve as a backup approach. Here’s a link, although I clearly
>> haven’t touched it lately. Feel free to ask if you have issues:
>> On 3/1/16, 12:09 PM, "danny teichthal" <> wrote:
>> >Hi,
>> >Just summarizing my questions if the long mail is a little intimidating:
>> >1. Is there a best practice/automated tool for overcoming problems in
>> >cluster state coming from zookeeper disconnections?
>> >2. Creating a collection via core admin is discouraged, is it true also
>> for
>> > discovery?
>> >
>> >I would like to be able to specify collection.configName in the
>> > and when starting server, the collection will be created
>> >and linked to the config name specified.
>> >
>> >
>> >
>> >On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal <>
>> >wrote:
>> >
>> >> Hi,
>> >>
>> >>
>> >> I would like to describe a process we use for overcoming problems in
>> >> cluster state when we have networking issues. Would appreciate if anyone
>> >> can answer about what are the flaws on this solution and what is the
>> best
>> >> practice for recovery in case of network problems involving zookeeper.
>> >> I'm working with Solr Cloud with version 5.2.1
>> >> ~100 collections in a cluster of 6 machines.
>> >>
>> >> This is the short procedure:
>> >> 1. Bring all the cluster down.
>> >> 2. Clear all data from zookeeper.
>> >> 3. Upload configuration.
>> >> 4. Restart the cluster.
>> >>
>> >> We rely on the fact that a collection is created on core discovery
>> >> process, if it does not exist. It gives us much flexibility.
>> >> When the cluster comes up, it reads from and creates the
>> >> collections if needed.
>> >> Since we have only one configuration, the collections are automatically
>> >> linked to it and the cores inherit it from the collection.
>> >> This is a very robust procedure, that helped us overcome many problems
>> >> until we stabilized our cluster which is now pretty stable.
>> >> I know that the leader might change in such case and may lose updates,
>> but
>> >> it is ok.
>> >>
>> >>
>> >> The problem is that today I want to add a new config set.
>> >> When I add it and clear zookeeper, the cores cannot be created because
>> >> there are 2 configurations. This breaks my recovery procedure.
>> >>
>> >> I thought about a few options:
>> >> 1. Put the config Name in - this doesn't work. (It is
>> >> supported in CoreAdminHandler, but  is discouraged according to
>> >> documentation)
>> >> 2. Change recovery procedure to not delete all data from zookeeper, but
>> >> only relevant parts.
>> >> 3. Change recovery procedure to delete all, but recreate and link
>> >> configurations for all collections before startup.
>> >>
>> >> Option #1 is my favorite, because it is very simple, it is currently not
>> >> supported, but from looking on code it looked like it is not complex to
>> >> implement.
>> >>
>> >>
>> >>
>> >> My questions are:
>> >> 1. Is there something wrong in the recovery procedure that I described ?
>> >> 2. What is the best way to fix problems in cluster state, except from
>> >> editing clusterstate.json manually? Is there an automated tool for
>> that? We
>> >> have about 100 collections in a cluster, so editing is not really a
>> >> solution.
>> >> 3.Is creating a collection via is also discouraged?
>> >>
>> >>
>> >>
>> >> Would very appreciate any answers/ thoughts on that.
>> >>
>> >>
>> >> Thanks,
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
View raw message