lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Replication error and Shard Inconsistencies..
Date Wed, 05 Dec 2012 00:04:08 GMT
Hey Annette, 

Are you using Solr 4.0 final? A version of 4x or 5x?

Do you have the logs for when the replica tried to catch up to the leader?

Stopping and starting the node is actually a fine thing to do. Perhaps you can try it again
and capture the logs.

If a node is not listed as live but is in the clusterstate, that is fine. It shouldn't be
consulted. To remove it, you either have to unload it with the core admin api or you could
manually delete it's registered state under the node states node that the Overseer looks at.

Also, it would be useful to see the logs of the new node coming up…there should be info
about what happens when it tries to replicate.

It almost sounds like replication is just not working for your setup at all and that you have
to tweak some configuration. You shouldn't see these nodes as active then though - so we should
get to the bottom of this.

- Mark

On Dec 4, 2012, at 4:37 AM, Annette Newton <annette.newton@servicetick.com> wrote:

> Hi all,
>  
> I have a quite weird issue with Solr cloud.  I have a 4 shard, 2 replica setup, yesterday
one of the nodes lost communication with the cloud setup, which resulted in it trying to run
replication, this failed, which has left me with a Shard (Shard 4) that has one node with
2,833,940 documents on the leader and 409,837 on the follower – obviously a big discrepancy
and this leads to queries returning differing results depending on which of these nodes it
gets the data from.  There is no indication of a problem on the admin site other than the
big discrepancy in the number of documents.  They are all marked as active etc…
>  
> So I thought that I would force replication to happen again, by stopping and starting
solr (probably the wrong thing to do) but this resulted in no change.  So I turned off that
node and replaced it with a new one.  In zookeeper live nodes doesn’t list that machine
but it is still being shown as active on in the ClusterState.json, I have attached images
showing this…  This means the new node hasn’t replaced the old node but is now a replica
on Shard 1!  Also that node doesn’t appear to have replicated Shard 1’s data anyway, it
didn’t get marked with replicating or anything… 
>  
> How do I clear the zookeeper state without taking down the entire solr cloud setup? 
How do I force a node to replicate from the others in the shard?
>  
> Thanks in advance.
>  
> Annette Newton
>  
>  
> <LiveNodes.zip>


Mime
View raw message