lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Timothy Potter (JIRA)" <>
Subject [jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
Date Wed, 11 Dec 2013 23:35:07 GMT


Timothy Potter commented on SOLR-4260:

I don't have fix yet, but I wanted to post an update here to get some feedback on what I'm
seeing ...

I have a simple SolrCloud configuration setup locally: 1 collection named "cloud" with 1 shard
and replicationFactor 2, i.e. here's what I use to create it:
curl "http://localhost:8984/solr/admin/collections?action=CREATE&name=cloud&replicationFactor=$REPFACT&numShards=1&collection.configName=cloud"

The collection gets distributed on two nodes: cloud84:8984 and cloud85:8985 with cloud84 being
assigned the leader.

Here's an outline of the process I used to get my collection out-of-sync during indexing:

1) start indexing docs using CloudSolrServer in SolrJ - direct updates go to the leader and
replica remains in sync for as long as I let this process run
2) kill -9 the process for the replica cloud85
3) let indexing continue against cloud84 for a few seconds (just to get the leader and replica
out-of-sync once I bring the replica back online)
4) kill -9 the process for the leader cloud84 ... indexing halts of course as there are no
running servers
5) start the replica cloud85 but do not start the previous leader cloud84

Here are some key log messages as cloud85 - the replica - fires up ... my annotations of the
log messages are prefixed by [TJP >>

2013-12-11 11:43:22,076 [main-EventThread] INFO  - A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has
occurred - updating... (live nodes size: 1)
2013-12-11 11:43:23,370 [coreLoadExecutor-3-thread-1] INFO
 - Waiting until we see more replicas up for shard shard1: total=2 found=1 timeoutin=139841

[TJP >> This looks good and is expected because cloud85 was not the leader before it
died, so it should not immediately assume it is the leader until it sees more replicas

6) now start the previous leader cloud84 ...

Here are some key log messages from cloud85 as the previous leader cloud84 is coming up ...

2013-12-11 11:43:24,085 [main-EventThread] INFO  - Updating live
nodes... (2)
2013-12-11 11:43:24,136 [main-EventThread] INFO  - LatchChildWatcher
fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged
2013-12-11 11:43:24,137 [Thread-13] INFO  - Updating cloud state
from ZooKeeper... 
2013-12-11 11:43:24,138 [Thread-13] INFO  - Update state numShards=1

[TJP >> state of cloud84 looks correct as it is still initializing ...

2013-12-11 11:43:24,140 [main-EventThread] INFO  - LatchChildWatcher
fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged
2013-12-11 11:43:24,141 [main-EventThread] INFO  - A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has
occurred - updating... (live nodes size: 2)

2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO
 - Enough replicas found to continue.

[TJP >> hmmmm ... cloud84 is listed in /live_nodes but it isn't "active" yet or even
recovering (see state above - it's currently "down") ... My thinking here is that the ShardLeaderElectionContext
needs to take the state of the replica into account before deciding it should continue.

2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO
 - I may be the new leader - try and sync
2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO  - Sync
replicas to http://cloud85:8985/solr/cloud_shard1_replica1/
2013-12-11 11:43:25,880 [coreLoadExecutor-3-thread-1] INFO  solr.update.PeerSync  - PeerSync:
core=cloud_shard1_replica1 url=http://cloud85:8985/solr START replicas=[http://cloud84:8984/solr/cloud_shard1_replica2/]
2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN  solr.update.PeerSync  - PeerSync:
core=cloud_shard1_replica1 url=http://cloud85:8985/solr  couldn't connect to http://cloud84:8984/solr/cloud_shard1_replica2/,
counting as success

[TJP >> whoops! of course it couldn't connect to cloud84 as it's still initializing

2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] INFO  solr.update.PeerSync  - PeerSync:
core=cloud_shard1_replica1 url=http://cloud85:8985/solr DONE. sync succeeded
2013-12-11 11:43:25,937 [coreLoadExecutor-3-thread-1] INFO  - Sync
Success - now sync replicas to me
2013-12-11 11:43:25,937 [coreLoadExecutor-3-thread-1] INFO  - http://cloud85:8985/solr/cloud_shard1_replica1/:
try and ask http://cloud84:8984/solr/cloud_shard1_replica2/ to sync
2013-12-11 11:43:25,938 [coreLoadExecutor-3-thread-1] ERROR  - Sync
request error: org.apache.solr.client.solrj.SolrServerException: Server refused connection
at: http://cloud84:8984/solr/cloud_shard1_replica2

[TJP >> ayep, cloud84 is still initializing so it can't respond to you Mr. Impatient

2013-12-11 11:43:25,939 [coreLoadExecutor-3-thread-1] INFO  - http://cloud85:8985/solr/cloud_shard1_replica1/:
Sync failed - asking replica (http://cloud84:8984/solr/cloud_shard1_replica2/) to recover.
2013-12-11 11:43:25,940 [coreLoadExecutor-3-thread-1] INFO
 - I am the new leader: http://cloud85:8985/solr/cloud_shard1_replica1/ shard1

[TJP >> oh no! the collection is now out-of-sync ... my test harness periodically polls
the replicas for their doc counts and at this point, we ended up with:
shard1: {
	http://cloud85:8985/solr/cloud_shard1_replica1/ = 300800 LEADER
	http://cloud84:8984/solr/cloud_shard1_replica2/ = 447600 diff:-146800  <--- this should
be the real leader!

Which of course is expected because cloud85 should *NOT* be the leader

So all that is interesting, but how to fix???

My first idea was to go tackle the decision making process ShardLeaderElectionContext uses
to decide if it has enough replicas to continue. 

It's easy enough to do something like the following:
        int notDownCount = 0;
        Map<String,Replica> replicasMap = slices.getReplicasMap();
        for (Replica replica : replicasMap.values()) {
          ZkCoreNodeProps replicaCoreProps = new ZkCoreNodeProps(replica);
          String replicaState = replicaCoreProps.getState();
          log.warn(">>>> State of replica "+replica.getName()+" is "+replicaState+"
          if ("active".equals(replicaState) || "recovering".equals(replicaState)) {

Was thinking I could use the notDownCount to make a better decision, but then I ran into another
issue related to replica state being stale. In my cluster, if I have /clusterstate.json:


If I kill the process using kill -9 PID for the Solr running on 8985 (the replica), core_node2's
state remains "active" in /clusterstate.json

When tailing the log on core_node1, I do see one notification coming in the watcher setup
by ZkStateReader from ZooKeeper about live nodes having changed:
2013-12-11 15:42:46,010 [main-EventThread] INFO  - Updating live
nodes... (1)

So after killing the process, /live_nodes is updated to only have one node, but /clusterstate.json
still thinks there are 2 healthy replicas for shard1, instead of just 1.

Of course, if I restart 8985, then it goes through a series of state changes until it is marked
active again, which looks correct.

Bottom line ... it seems there is something in SolrCloud that does not update a replica's
state when the node is killed. If a change to /live_nodes doesn't trigger a refresh of replica
state, what does?

I'm seeing this stale replica state issue in Solr 4.6.0 and in revision 1550300 of branch_4x
- the latest from svn.

Not having a fresh state of a replica prevents my idea for fixing ShardLeaderElectionContext's
decision making process. I'm also curious about the decision to register a node under /live_nodes
before it is fully initialized, but maybe that is a discussion for another time.

In any case, I wanted to get some feedback on my findings before moving forward with a solution.

> Inconsistent numDocs between leader and replica
> -----------------------------------------------
>                 Key: SOLR-4260
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>         Environment:
>            Reporter: Markus Jelsma
>            Assignee: Mark Miller
>            Priority: Critical
>             Fix For: 5.0, 4.7
>         Attachments:,, clusterstate.png
> After wiping all cores and reindexing some 3.3 million docs from Nutch using CloudSolrServer
we see inconsistencies between the leader and replica for some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have a small
deviation in then number of documents. The leader and slave deviate for roughly 10-20 documents,
not more.
> Results hopping ranks in the result set for identical queries got my attention, there
were small IDF differences for exactly the same record causing a record to shift positions
in the result set. During those tests no records were indexed. Consecutive catch all queries
also return different number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor of two and
frequently reindex using a fresh build from trunk. I've not seen this issue for quite some
time until a few days ago.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message