lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Timothy Potter (JIRA)" <>
Subject [jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
Date Mon, 23 Dec 2013 23:06:52 GMT


Timothy Potter commented on SOLR-4260:

Found another interesting case that may or may not be valid depending on whether we think
HTTP requests between a leader and replica can fail even if the ZooKeeper session on the replica
does not drop?

Specifically, what I'm seeing is that if an update request between the leader and replica
fails, but the replica doesn't lose it's session with ZK, then the replica can get out-of-sync
with the leader. In a real network partition, the ZK connection would also likely be lost
and the replica would get marked as down. So as long as the HTTP connection timeout between
the leader and replica exceeds the ZK client timeout, the replica would probably recover correctly,
rendering this test case invalid. So maybe the main question here is whether we think it's
possible for HTTP requests between a leader and replica to fail even though the ZooKeeper
connection stays alive?

Here's the steps I used to reproduce this case (all using revision 1553150 in branch_4x):

*> STEP 1: Setup a collection named “cloud” containing 1 shard and 2 replicas on hosts:
cloud84 ( and cloud85 (*

$SOLR_TOP/cloud84/cloud-scripts/ -zkhost $ZK_HOST -cmd upconfig -confdir $SOLR_TOP/cloud84/solr/cloud/conf
-confname cloud
curl -v "$API?action=CREATE&name=cloud&replicationFactor=2&numShards=1&collection.configName=cloud"

Replica on cloud84 is elected as the initial leader. /clusterstate.json looks like:


*> STEP 2: Simulate network partition*

sudo iptables -I INPUT 1 -i lo -p tcp --sport 8985 -j DROP; sudo iptables -I INPUT 2 -i lo
-p tcp --dport 8985 -j DROP

Various ways to do this, but to keep it simple, I'm just dropping inbound traffic on localhost
to port 8985.

*> STEP 3: Send document with ID “doc1” to leader on cloud84*

curl "http://localhost:8984/solr/cloud/update" -H 'Content-type:application/xml' \
  --data-binary '<add><doc><field name="id">doc1</field><field

The update request takes some time because the replica is down but ultimately succeeds on
the leader. In the logs on the leader, we have (some stack trace lines removed for clarity):

2013-12-23 10:59:33,688 [updateExecutor-1-thread-1] ERROR solr.update.StreamingSolrServers
 - error
org.apache.http.conn.HttpHostConnectException: Connection to http://cloud85:8985 refused
        at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$
Caused by: Connection timed out
2013-12-23 10:59:33,695 [qtp1073932139-16] INFO  update.processor.LogUpdateProcessor  - [cloud_shard1_replica1]
webapp=/solr path=/update params={} {add=[doc1 (1455228778490888192)]} 0 63256
2013-12-23 10:59:33,702 [updateExecutor-1-thread-2] INFO  update.processor.DistributedUpdateProcessor
 - try and ask http://cloud85:8985/solr to recover
2013-12-23 10:59:48,718 [updateExecutor-1-thread-2] ERROR update.processor.DistributedUpdateProcessor
 - http://cloud85:8985/solr: Could not tell a replica to recover:org.apache.solr.client.solrj.SolrServerException:
IOException occured when talking to server at: http://cloud85:8985/solr
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(
        at org.apache.solr.update.processor.DistributedUpdateProcessor$
Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to cloud85:8985 timed out

Of course these log messages are expected. The key is that the leader accepted the update
and now has one doc with ID "doc1"

> STEP 4: Heal the network partition

sudo service iptables restart (undoes the DROP rules we added above)

*> STEP 5: Send document with ID “doc2” to leader on cloud84*

curl "http://localhost:8984/solr/cloud/update" -H 'Content-type:application/xml' \
  --data-binary '<add><doc><field name="id">doc2</field><field

Of course this time the update gets sent successfully to replica ... here are some log messages

from the log on cloud84:
2013-12-23 11:00:46,982 [qtp1073932139-18] INFO  update.processor.LogUpdateProcessor  - [cloud_shard1_replica1]
webapp=/solr path=/update params={} {add=[doc2 (1455228921389776896)]} 0 162

from the log on cloud85 (out-of-sync replica):
2013-12-23 10:47:26,363 [main-EventThread] INFO  - A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has
occurred - updating... (live nodes size: 2)


*2013-12-23 11:00:46,979 [qtp2124890785-12] INFO  update.processor.LogUpdateProcessor  - [cloud_shard1_replica2]
webapp=/solr path=/update params={distrib.from=http://cloud84:8984/solr/cloud_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2}
{add=[doc2 (1455228921389776896)]} 0 142*

Notice that there is no logged activity on cloud85 between 10:47 and 11:00

*> STEP 6: Commit updates*

curl "http://localhost:8984/solr/cloud/update" -H 'Content-type:application/xml' --data-binary
"<commit waitSearcher=\"true\"/>"

*> STEP 7: Send non-distributed queries to each replica*

curl "http://localhost:8984/solr/cloud/select?q=foo_s:bar&rows=0&wt=json&distrib=false"


curl "http://localhost:8985/solr/cloud/select?q=foo_s:bar&rows=0&wt=json&distrib=false"


Observe that the leader has 2 docs and the replica on cloud85 only has 1, but should have

>From what I can tell, the replica that missed some updates because of a temporary network
partition doesn't get any notification that it missed some documents. In other words, the
out-of-sync replica doesn't know it's out-of-sync and it's state in ZooKeeper is active. As
you can see from the log messages I posted in step 3 above, the leader tried to tell the replica
to recovery, but due to the network partition, that request got dropped too.

I'm wondering if the leader should send a state version tracking ID along with each update
request so that a replica can detect that it's view of state was stale? I could see the process
working as follows:

1. Shard leader now keeps track of a Slice state version tracking identifier that gets sent
with every update request
2. Leader tries to send an update request (including the state version ID) to a replica and
send fails
3. Leader updates the state version to a different value
4. Leader sends another update request to the replica; request includes the updated version
ID; replica accepts the request but realizes its state version ID is out-of-date from what
the leader sent
5. Replica enters recovery

> Inconsistent numDocs between leader and replica
> -----------------------------------------------
>                 Key: SOLR-4260
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>         Environment:
>            Reporter: Markus Jelsma
>            Assignee: Mark Miller
>            Priority: Critical
>             Fix For: 5.0, 4.7
>         Attachments:,, clusterstate.png
> After wiping all cores and reindexing some 3.3 million docs from Nutch using CloudSolrServer
we see inconsistencies between the leader and replica for some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have a small
deviation in then number of documents. The leader and slave deviate for roughly 10-20 documents,
not more.
> Results hopping ranks in the result set for identical queries got my attention, there
were small IDF differences for exactly the same record causing a record to shift positions
in the result set. During those tests no records were indexed. Consecutive catch all queries
also return different number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor of two and
frequently reindex using a fresh build from trunk. I've not seen this issue for quite some
time until a few days ago.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message