lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Timothy Potter (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
Date Mon, 06 Apr 2015 21:00:13 GMT

    [ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481921#comment-14481921
] 

Timothy Potter commented on SOLR-7338:
--------------------------------------

good catch ... going to try to get the failure to reproduce locally now


> A reloaded core will never register itself as active after a ZK session expiration
> ----------------------------------------------------------------------------------
>
>                 Key: SOLR-7338
>                 URL: https://issues.apache.org/jira/browse/SOLR-7338
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Timothy Potter
>            Assignee: Mark Miller
>             Fix For: Trunk, 5.1
>
>         Attachments: SOLR-7338.patch, SOLR-7338_test.patch
>
>
> If a collection gets reloaded, then a core's isReloaded flag is always true. If a core
experiences a ZK session expiration after a reload, then it won't ever be able to set itself
to active because of the check in {{ZkController#register}}:
> {code}
>         UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
>         if (!core.isReloaded() && ulog != null) {
>           // disable recovery in case shard is in construction state (for shard splits)
>           Slice slice = getClusterState().getSlice(collection, shardId);
>           if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
>             Future<UpdateLog.RecoveryInfo> recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog();
>             if (recoveryFuture != null) {
>               log.info("Replaying tlog for " + ourUrl + " during startup... NOTE: This
can take a while.");
>               recoveryFuture.get(); // NOTE: this could potentially block for
>               // minutes or more!
>               // TODO: public as recovering in the mean time?
>               // TODO: in the future we could do peersync in parallel with recoverFromLog
>             } else {
>               log.info("No LogReplay needed for core=" + core.getName() + " baseURL="
+ baseUrl);
>             }
>           }
>           boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader,
cloudDesc,
>               collection, coreZkNodeName, shardId, leaderProps, core, cc);
>           if (!didRecovery) {
>             publish(desc, ZkStateReader.ACTIVE);
>           }
>         }
> {code}
> I can easily simulate this on trunk by doing:
> {code}
> bin/solr -c -z localhost:2181
> bin/solr create -c foo
> bin/post -c foo example/exampledocs/*.xml
> curl "http://localhost:8983/solr/admin/collections?action=RELOAD&name=foo"
> kill -STOP <PID> && sleep <PAUSE_SECONDS> && kill -CONT <PID>
> {code}
> Where <PID> is the process ID of the Solr node. Here are the logs after the CONT
command. As you can see below, the core never gets to setting itself as active again. I think
the bug is that the isReloaded flag needs to get set back to false once the reload is successful,
but I don't understand what this flag is needed for anyway???
> {code}
> INFO  - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher
org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181
got event WatchedEvent state:Disconnected type:None path:null path:null type:None
> INFO  - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient
has disconnected
> INFO  - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher
org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181
got event WatchedEvent state:Expired type:None path:null path:null type:None
> INFO  - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our
previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with
ZooKeeper...
> INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000)
closing
> INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node
got unwatched for /configs/foo
> INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater;
Overseer Loop exiting : 192.168.1.2:8983_solr
> INFO  - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According
to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) am no longer a leader.
> INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners
for /configs/foo
> INFO  - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy;
Connection expired - starting a new one...
> INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config update listener
called for core foo_shard1_replica1
> INFO  - 2015-04-01 17:28:51.110; org.apache.solr.common.cloud.ConnectionManager; Waiting
for client to connect to ZooKeeper
> ERROR - 2015-04-01 17:28:51.110; org.apache.solr.common.SolrException; OverseerAutoReplicaFailoverThread
had an error in its thread work loop.:org.apache.solr.common.SolrException: Error reading
cluster properties
> 	at org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:772)
> 	at org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.doWork(OverseerAutoReplicaFailoverThread.java:150)
> 	at org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.run(OverseerAutoReplicaFailoverThread.java:129)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException: sleep interrupted
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.solr.common.cloud.ZkCmdExecutor.retryDelay(ZkCmdExecutor.java:108)
> 	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:76)
> 	at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:308)
> 	at org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:765)
> 	... 3 more
> WARN  - 2015-04-01 17:28:51.110; org.apache.solr.cloud.ZkController$4; listener throws
error
> org.apache.solr.common.SolrException: org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /configs/foo/params.json
> 	at org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:160)
> 	at org.apache.solr.core.SolrConfig.refreshRequestParams(SolrConfig.java:907)
> 	at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2503)
> 	at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2351)
> Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
= Session expired for /configs/foo/params.json
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:294)
> 	at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:291)
> 	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> 	at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:291)
> 	at org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:150)
> 	... 3 more
> ERROR - 2015-04-01 17:28:51.110; org.apache.solr.cloud.Overseer$ClusterStateUpdater;
could not read the data
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /overseer_elect/leader
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
> 	at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
> 	at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
> 	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> 	at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
> 	at org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:298)
> 	at org.apache.solr.cloud.Overseer$ClusterStateUpdater.access$300(Overseer.java:83)
> 	at org.apache.solr.cloud.Overseer$ClusterStateUpdater$2.run(Overseer.java:259)
> INFO  - 2015-04-01 17:28:51.114; org.apache.solr.common.cloud.ConnectionManager; Watcher
org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181
got event WatchedEvent state:SyncConnected type:None path:null path:null type:None
> INFO  - 2015-04-01 17:28:51.115; org.apache.solr.common.cloud.ConnectionManager; Client
is connected to ZooKeeper
> INFO  - 2015-04-01 17:28:51.115; org.apache.solr.common.cloud.ConnectionManager$1; Connection
with ZooKeeper reestablished.
> INFO  - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController$1; ZooKeeper session
re-connected ... refreshing core states after session expiration.
> INFO  - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController; publishing core=foo_shard1_replica1
state=down collection=foo
> INFO  - 2015-04-01 17:28:51.118; org.apache.solr.cloud.ElectionContext; canceling election
/overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000
> WARN  - 2015-04-01 17:28:51.119; org.apache.solr.cloud.ElectionContext; cancelElection
did not find election node to remove /overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000
> INFO  - 2015-04-01 17:28:51.121; org.apache.solr.cloud.OverseerElectionContext; I am
going to be the leader 192.168.1.2:8983_solr
> INFO  - 2015-04-01 17:28:51.121; org.apache.solr.common.cloud.SolrZkClient; makePath:
/overseer_elect/leader
> INFO  - 2015-04-01 17:28:51.122; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974594-192.168.1.2:8983_solr-n_0000000001)
starting
> INFO  - 2015-04-01 17:28:51.128; org.apache.solr.cloud.OverseerAutoReplicaFailoverThread;
Starting OverseerAutoReplicaFailoverThread autoReplicaFailoverWorkLoopDelay=10000 autoReplicaFailoverWaitAfterExpiration=30000
autoReplicaFailoverBadNodeExpiration=60000
> INFO  - 2015-04-01 17:28:51.128; org.apache.solr.cloud.OverseerCollectionProcessor; Process
current queue of collection creations
> INFO  - 2015-04-01 17:28:51.128; org.apache.solr.common.cloud.ZkStateReader; Updating
cluster state from ZooKeeper... 
> INFO  - 2015-04-01 17:28:51.129; org.apache.solr.cloud.Overseer$ClusterStateUpdater;
Starting to work on the main queue
> INFO  - 2015-04-01 17:28:51.130; org.apache.solr.common.cloud.ZkStateReader; addZkWatch
foo
> INFO  - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; Updating
collection state at /collections/foo/state.json from ZooKeeper... 
> INFO  - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; Updating
data for foo to ver 4 
> INFO  - 2015-04-01 17:28:51.131; org.apache.solr.cloud.ZkController; Register node as
live in ZooKeeper:/live_nodes/192.168.1.2:8983_solr
> INFO  - 2015-04-01 17:28:51.132; org.apache.solr.common.cloud.SolrZkClient; makePath:
/live_nodes/192.168.1.2:8983_solr
> INFO  - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ZkController; Register replica
- core:foo_shard1_replica1 address:http://192.168.1.2:8983/solr collection:foo shard:shard1
> INFO  - 2015-04-01 17:28:51.133; org.apache.solr.common.cloud.ZkStateReader; Updating
data for foo to ver 4 
> INFO  - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ElectionContext; canceling election
/collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000
> WARN  - 2015-04-01 17:28:51.134; org.apache.solr.cloud.ElectionContext; cancelElection
did not find election node to remove /collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000
> INFO  - 2015-04-01 17:28:51.134; org.apache.solr.cloud.Overseer$ClusterStateUpdater;
processMessage: queueSize: 1, message = {
>   "core":"foo_shard1_replica1",
>   "core_node_name":"core_node1",
>   "roles":null,
>   "base_url":"http://192.168.1.2:8983/solr",
>   "node_name":"192.168.1.2:8983_solr",
>   "numShards":"1",
>   "state":"down",
>   "shard":"shard1",
>   "collection":"foo",
>   "operation":"state"} current state version: 1
> INFO  - 2015-04-01 17:28:51.135; org.apache.solr.cloud.overseer.ReplicaMutator; Update
state numShards=1 message={
>   "core":"foo_shard1_replica1",
>   "core_node_name":"core_node1",
>   "roles":null,
>   "base_url":"http://192.168.1.2:8983/solr",
>   "node_name":"192.168.1.2:8983_solr",
>   "numShards":"1",
>   "state":"down",
>   "shard":"shard1",
>   "collection":"foo",
>   "operation":"state"}
> INFO  - 2015-04-01 17:28:51.136; org.apache.solr.cloud.ShardLeaderElectionContext; Running
the leader process for shard shard1
> INFO  - 2015-04-01 17:28:51.136; org.apache.solr.cloud.overseer.ZkStateWriter; going
to update_collection /collections/foo/state.json version: 4
> INFO  - 2015-04-01 17:28:51.136; org.apache.solr.cloud.ActionThrottle; The last leader
attempt started 198060ms ago.
> INFO  - 2015-04-01 17:28:51.137; org.apache.solr.common.cloud.ZkStateReader$7; A cluster
state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/foo/state.json
for collection foo has occurred - updating... (live nodes size: 1)
> INFO  - 2015-04-01 17:28:51.137; org.apache.solr.common.cloud.ZkStateReader; Updating
data for foo to ver 5 
> INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.DistributedQueue$LatchWatcher;
NodeChildrenChanged fired on path /overseer/queue state SyncConnected
> INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.ShardLeaderElectionContext; Enough
replicas found to continue.
> INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.ShardLeaderElectionContext; I
may be the new leader - try and sync
> INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync replicas to
http://192.168.1.2:8983/solr/foo_shard1_replica1/
> INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync Success - now
sync replicas to me
> INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; http://192.168.1.2:8983/solr/foo_shard1_replica1/
has no replicas
> INFO  - 2015-04-01 17:28:51.139; org.apache.solr.cloud.ShardLeaderElectionContext; I
am the new leader: http://192.168.1.2:8983/solr/foo_shard1_replica1/ shard1
> INFO  - 2015-04-01 17:28:51.139; org.apache.solr.common.cloud.SolrZkClient; makePath:
/collections/foo/leaders/shard1
> INFO  - 2015-04-01 17:28:51.139; org.apache.solr.cloud.Overseer$ClusterStateUpdater;
processMessage: queueSize: 1, message = {
>   "operation":"leader",
>   "shard":"shard1",
>   "collection":"foo"} current state version: 1
> INFO  - 2015-04-01 17:28:51.140; org.apache.solr.cloud.overseer.ZkStateWriter; going
to update_collection /collections/foo/state.json version: 5
> INFO  - 2015-04-01 17:28:51.141; org.apache.solr.common.cloud.ZkStateReader$7; A cluster
state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/foo/state.json
for collection foo has occurred - updating... (live nodes size: 1)
> INFO  - 2015-04-01 17:28:51.142; org.apache.solr.cloud.DistributedQueue$LatchWatcher;
NodeChildrenChanged fired on path /overseer/queue state SyncConnected
> INFO  - 2015-04-01 17:28:51.142; org.apache.solr.common.cloud.ZkStateReader; Updating
data for foo to ver 6 
> INFO  - 2015-04-01 17:28:51.144; org.apache.solr.cloud.Overseer$ClusterStateUpdater;
processMessage: queueSize: 1, message = {
>   "operation":"leader",
>   "shard":"shard1",
>   "collection":"foo",
>   "base_url":"http://192.168.1.2:8983/solr",
>   "core":"foo_shard1_replica1",
>   "state":"active"} current state version: 1
> INFO  - 2015-04-01 17:28:51.145; org.apache.solr.cloud.overseer.ZkStateWriter; going
to update_collection /collections/foo/state.json version: 6
> INFO  - 2015-04-01 17:28:51.145; org.apache.solr.common.cloud.ZkStateReader$7; A cluster
state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/foo/state.json
for collection foo has occurred - updating... (live nodes size: 1)
> INFO  - 2015-04-01 17:28:51.146; org.apache.solr.common.cloud.ZkStateReader; Updating
data for foo to ver 7 
> INFO  - 2015-04-01 17:28:51.147; org.apache.solr.cloud.DistributedQueue$LatchWatcher;
NodeChildrenChanged fired on path /overseer/queue state SyncConnected
> INFO  - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; We are http://192.168.1.2:8983/solr/foo_shard1_replica1/
and leader is http://192.168.1.2:8983/solr/foo_shard1_replica1/
> INFO  - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; 
>  In register, core.isReloaded? true 
> INFO  - 2015-04-01 17:28:51.199; org.apache.solr.common.cloud.ZkStateReader; Updating
data for foo to ver 7 
> INFO  - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController; watch zkdir /configs/foo
> INFO  - 2015-04-01 17:28:51.199; org.apache.solr.schema.ZkIndexSchemaReader; Creating
ZooKeeper watch for the managed schema at /configs/foo/managed-schema
> INFO  - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController$4; Running listeners
for /configs/foo
> INFO  - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrCore$11; config update listener
called for core foo_shard1_replica1
> INFO  - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; Current
schema version 5 is already the latest
> INFO  - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; Creating
ZooKeeper watch for the managed schema at /configs/foo/managed-schema
> INFO  - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrConfig; current version of
requestparams : 0
> INFO  - 2015-04-01 17:28:51.201; org.apache.solr.schema.ZkIndexSchemaReader; Current
schema version 5 is already the latest
> INFO  - 2015-04-01 17:28:51.201; org.apache.solr.common.cloud.DefaultConnectionStrategy;
Reconnected to ZooKeeper
> INFO  - 2015-04-01 17:28:51.202; org.apache.solr.common.cloud.ConnectionManager; Connected:true
> INFO  - 2015-04-01 17:28:51.211; org.apache.solr.cloud.ZkController$WatcherImpl; A node
got unwatched for /configs/foo
> INFO  - 2015-04-01 17:28:51.211; org.apache.solr.cloud.ZkController$4; Running listeners
for /configs/foo
> INFO  - 2015-04-01 17:28:51.211; org.apache.solr.core.SolrCore$11; config update listener
called for core foo_shard1_replica1
> INFO  - 2015-04-01 17:28:51.212; org.apache.solr.core.SolrConfig; current version of
requestparams : 0
> INFO  - 2015-04-01 17:29:07.080; org.apache.solr.servlet.SolrDispatchFilter; [admin]
webapp=null path=/admin/cores params={indexInfo=false&wt=json&_=1427909347078} status=0
QTime=0 
> INFO  - 2015-04-01 17:29:07.112; org.apache.solr.servlet.SolrDispatchFilter; [admin]
webapp=null path=/admin/info/system params={wt=json&_=1427909347099} status=0 QTime=11

> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message