lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Timothy Potter (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
Date Tue, 24 Dec 2013 00:37:51 GMT

    [ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856052#comment-13856052
] 

Timothy Potter commented on SOLR-4260:
--------------------------------------

Thanks Mark, I suspected my test case was a little cherry picked ... something interesting
happened when I also severed the connection between the replica and ZK (ie. same test as above
but I also dropped the ZK connection on the replica).

2013-12-23 15:39:57,170 [main-EventThread] INFO  common.cloud.ConnectionManager  - Watcher
org.apache.solr.common.cloud.ConnectionManager@4f857c62 name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181
got event WatchedEvent state:Disconnected type:None path:null path:null type:None
2013-12-23 15:39:57,170 [main-EventThread] INFO  common.cloud.ConnectionManager  - zkClient
has disconnected

>>> fixed the connection between replica and ZK here <<<

2013-12-23 15:40:45,579 [main-EventThread] INFO  common.cloud.ConnectionManager  - Watcher
org.apache.solr.common.cloud.ConnectionManager@4f857c62 name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181
got event WatchedEvent state:Expired type:None path:null path:null type:None
2013-12-23 15:40:45,579 [main-EventThread] INFO  common.cloud.ConnectionManager  - Our previous
ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper...
2013-12-23 15:40:45,580 [main-EventThread] INFO  common.cloud.DefaultConnectionStrategy  -
Connection expired - starting a new one...
2013-12-23 15:40:45,586 [main-EventThread] INFO  common.cloud.ConnectionManager  - Waiting
for client to connect to ZooKeeper
2013-12-23 15:40:45,595 [main-EventThread] INFO  common.cloud.ConnectionManager  - Watcher
org.apache.solr.common.cloud.ConnectionManager@4f857c62 name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181
got event WatchedEvent state:SyncConnected type:None path:null path:null type:None
2013-12-23 15:40:45,595 [main-EventThread] INFO  common.cloud.ConnectionManager  - Client
is connected to ZooKeeper
2013-12-23 15:40:45,595 [main-EventThread] INFO  common.cloud.ConnectionManager  - Connection
with ZooKeeper reestablished.
2013-12-23 15:40:45,596 [main-EventThread] WARN  solr.cloud.RecoveryStrategy  - Stopping recovery
for zkNodeName=core_node3core=cloud_shard1_replica3
2013-12-23 15:40:45,597 [main-EventThread] INFO  solr.cloud.ZkController  - publishing core=cloud_shard1_replica3
state=down
2013-12-23 15:40:45,597 [main-EventThread] INFO  solr.cloud.ZkController  - numShards not
found on descriptor - reading it from system property
2013-12-23 15:40:45,905 [qtp2124890785-14] INFO  handler.admin.CoreAdminHandler  - It has
been requested that we recover
2013-12-23 15:40:45,906 [qtp2124890785-14] INFO  solr.servlet.SolrDispatchFilter  - [admin]
webapp=null path=/admin/cores params={action=REQUESTRECOVERY&core=cloud_shard1_replica3&wt=javabin&version=2}
status=0 QTime=2 
2013-12-23 15:40:45,909 [Thread-17] INFO  solr.cloud.ZkController  - publishing core=cloud_shard1_replica3
state=recovering
2013-12-23 15:40:45,909 [Thread-17] INFO  solr.cloud.ZkController  - numShards not found on
descriptor - reading it from system property
2013-12-23 15:40:45,920 [Thread-17] INFO  solr.update.DefaultSolrCoreState  - Running recovery
- first canceling any ongoing recovery
2013-12-23 15:40:45,921 [RecoveryThread] INFO  solr.cloud.RecoveryStrategy  - Starting recovery
process.  core=cloud_shard1_replica3 recoveringAfterStartup=false
2013-12-23 15:40:45,924 [RecoveryThread] INFO  solr.cloud.ZkController  - publishing core=cloud_shard1_replica3
state=recovering
2013-12-23 15:40:45,924 [RecoveryThread] INFO  solr.cloud.ZkController  - numShards not found
on descriptor - reading it from system property
2013-12-23 15:40:48,613 [qtp2124890785-15] INFO  solr.core.SolrCore  - [cloud_shard1_replica3]
webapp=/solr path=/select params={q=foo_s:bar&distrib=false&wt=json&rows=0} hits=0
status=0 QTime=1 
2013-12-23 15:42:42,770 [qtp2124890785-13] INFO  solr.core.SolrCore  - [cloud_shard1_replica3]
webapp=/solr path=/select params={q=foo_s:bar&distrib=false&wt=json&rows=0} hits=0
status=0 QTime=1 
2013-12-23 15:42:45,650 [main-EventThread] ERROR solr.cloud.ZkController  - There was a problem
making a request to the leader:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
I was asked to wait on state down for cloud86:8986_solr but I still do not see the requested
state. I see state: recovering live:false
	at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495)
	at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
	at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1434)
	at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:347)
	at org.apache.solr.cloud.ZkController.access$100(ZkController.java:85)
	at org.apache.solr.cloud.ZkController$1.command(ZkController.java:225)
	at org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:118)
	at org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:56)
	at org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:93)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)

2013-12-23 15:42:45,963 [RecoveryThread] ERROR solr.cloud.RecoveryStrategy  - Error while
trying to recover. core=cloud_shard1_replica3:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
I was asked to wait on state recovering for cloud86:8986_solr but I still do not see the requested
state. I see state: recovering live:false
	at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495)
	at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
	at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:224)
	at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:371)
	at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:247)

2013-12-23 15:42:45,964 [RecoveryThread] ERROR solr.cloud.RecoveryStrategy  - Recovery failed
- trying again... (0) core=cloud_shard1_replica3
2013-12-23 15:42:45,964 [RecoveryThread] INFO  solr.cloud.RecoveryStrategy  - Wait 2.0 seconds
before trying to recover again (1)
2013-12-23 15:42:47,964 [RecoveryThread] INFO  solr.cloud.ZkController  - publishing core=cloud_shard1_replica3
state=recovering

> Inconsistent numDocs between leader and replica
> -----------------------------------------------
>
>                 Key: SOLR-4260
>                 URL: https://issues.apache.org/jira/browse/SOLR-4260
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>         Environment: 5.0.0.2013.01.04.15.31.51
>            Reporter: Markus Jelsma
>            Assignee: Mark Miller
>            Priority: Critical
>             Fix For: 5.0, 4.7
>
>         Attachments: 192.168.20.102-replica1.png, 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using CloudSolrServer
we see inconsistencies between the leader and replica for some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have a small
deviation in then number of documents. The leader and slave deviate for roughly 10-20 documents,
not more.
> Results hopping ranks in the result set for identical queries got my attention, there
were small IDF differences for exactly the same record causing a record to shift positions
in the result set. During those tests no records were indexed. Consecutive catch all queries
also return different number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor of two and
frequently reindex using a fresh build from trunk. I've not seen this issue for quite some
time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message