lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-7338) A reloaded core will never register itself as active after a ZK session expiration
Date Thu, 02 Apr 2015 14:33:55 GMT

    [ https://issues.apache.org/jira/browse/SOLR-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392763#comment-14392763
] 

Mark Miller commented on SOLR-7338:
-----------------------------------

bq. "isReloaded" feels more like "isReloading"

What's the gain, what's the point, what's the alternative?

I don't get that at all. It tells you if the core has been reloaded. This is often useful
in things that happen on creating a new SolrCore.

Who cares about isReloading? I'm lost.

Is it just too difficult to understand what isReloaded means?

I'd be more confused with this temporary isReloading call - seems so easy for that to be tricky.
isReloaded is so permanent and easy to understand. The core has been reloaded or it hasn't.
How the heck does trying to track exactly when the core is actually in the process of reload
or not more useful?

Anyone else?

> A reloaded core will never register itself as active after a ZK session expiration
> ----------------------------------------------------------------------------------
>
>                 Key: SOLR-7338
>                 URL: https://issues.apache.org/jira/browse/SOLR-7338
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Timothy Potter
>            Assignee: Mark Miller
>         Attachments: SOLR-7338.patch, SOLR-7338_test.patch
>
>
> If a collection gets reloaded, then a core's isReloaded flag is always true. If a core
experiences a ZK session expiration after a reload, then it won't ever be able to set itself
to active because of the check in {{ZkController#register}}:
> {code}
>         UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
>         if (!core.isReloaded() && ulog != null) {
>           // disable recovery in case shard is in construction state (for shard splits)
>           Slice slice = getClusterState().getSlice(collection, shardId);
>           if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
>             Future<UpdateLog.RecoveryInfo> recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog();
>             if (recoveryFuture != null) {
>               log.info("Replaying tlog for " + ourUrl + " during startup... NOTE: This
can take a while.");
>               recoveryFuture.get(); // NOTE: this could potentially block for
>               // minutes or more!
>               // TODO: public as recovering in the mean time?
>               // TODO: in the future we could do peersync in parallel with recoverFromLog
>             } else {
>               log.info("No LogReplay needed for core=" + core.getName() + " baseURL="
+ baseUrl);
>             }
>           }
>           boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader,
cloudDesc,
>               collection, coreZkNodeName, shardId, leaderProps, core, cc);
>           if (!didRecovery) {
>             publish(desc, ZkStateReader.ACTIVE);
>           }
>         }
> {code}
> I can easily simulate this on trunk by doing:
> {code}
> bin/solr -c -z localhost:2181
> bin/solr create -c foo
> bin/post -c foo example/exampledocs/*.xml
> curl "http://localhost:8983/solr/admin/collections?action=RELOAD&name=foo"
> kill -STOP <PID> && sleep <PAUSE_SECONDS> && kill -CONT <PID>
> {code}
> Where <PID> is the process ID of the Solr node. Here are the logs after the CONT
command. As you can see below, the core never gets to setting itself as active again. I think
the bug is that the isReloaded flag needs to get set back to false once the reload is successful,
but I don't understand what this flag is needed for anyway???
> {code}
> INFO  - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher
org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181
got event WatchedEvent state:Disconnected type:None path:null path:null type:None
> INFO  - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient
has disconnected
> INFO  - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher
org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181
got event WatchedEvent state:Expired type:None path:null path:null type:None
> INFO  - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our
previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with
ZooKeeper...
> INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000)
closing
> INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node
got unwatched for /configs/foo
> INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater;
Overseer Loop exiting : 192.168.1.2:8983_solr
> INFO  - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According
to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) am no longer a leader.
> INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners
for /configs/foo
> INFO  - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy;
Connection expired - starting a new one...
> INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config update listener
called for core foo_shard1_replica1
> INFO  - 2015-04-01 17:28:51.110; org.apache.solr.common.cloud.ConnectionManager; Waiting
for client to connect to ZooKeeper
> ERROR - 2015-04-01 17:28:51.110; org.apache.solr.common.SolrException; OverseerAutoReplicaFailoverThread
had an error in its thread work loop.:org.apache.solr.common.SolrException: Error reading
cluster properties
> 	at org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:772)
> 	at org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.doWork(OverseerAutoReplicaFailoverThread.java:150)
> 	at org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.run(OverseerAutoReplicaFailoverThread.java:129)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException: sleep interrupted
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.solr.common.cloud.ZkCmdExecutor.retryDelay(ZkCmdExecutor.java:108)
> 	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:76)
> 	at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:308)
> 	at org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:765)
> 	... 3 more
> WARN  - 2015-04-01 17:28:51.110; org.apache.solr.cloud.ZkController$4; listener throws
error
> org.apache.solr.common.SolrException: org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /configs/foo/params.json
> 	at org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:160)
> 	at org.apache.solr.core.SolrConfig.refreshRequestParams(SolrConfig.java:907)
> 	at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2503)
> 	at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2351)
> Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
= Session expired for /configs/foo/params.json
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:294)
> 	at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:291)
> 	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> 	at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:291)
> 	at org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:150)
> 	... 3 more
> ERROR - 2015-04-01 17:28:51.110; org.apache.solr.cloud.Overseer$ClusterStateUpdater;
could not read the data
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /overseer_elect/leader
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
> 	at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
> 	at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
> 	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> 	at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
> 	at org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:298)
> 	at org.apache.solr.cloud.Overseer$ClusterStateUpdater.access$300(Overseer.java:83)
> 	at org.apache.solr.cloud.Overseer$ClusterStateUpdater$2.run(Overseer.java:259)
> INFO  - 2015-04-01 17:28:51.114; org.apache.solr.common.cloud.ConnectionManager; Watcher
org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181
got event WatchedEvent state:SyncConnected type:None path:null path:null type:None
> INFO  - 2015-04-01 17:28:51.115; org.apache.solr.common.cloud.ConnectionManager; Client
is connected to ZooKeeper
> INFO  - 2015-04-01 17:28:51.115; org.apache.solr.common.cloud.ConnectionManager$1; Connection
with ZooKeeper reestablished.
> INFO  - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController$1; ZooKeeper session
re-connected ... refreshing core states after session expiration.
> INFO  - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController; publishing core=foo_shard1_replica1
state=down collection=foo
> INFO  - 2015-04-01 17:28:51.118; org.apache.solr.cloud.ElectionContext; canceling election
/overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000
> WARN  - 2015-04-01 17:28:51.119; org.apache.solr.cloud.ElectionContext; cancelElection
did not find election node to remove /overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000
> INFO  - 2015-04-01 17:28:51.121; org.apache.solr.cloud.OverseerElectionContext; I am
going to be the leader 192.168.1.2:8983_solr
> INFO  - 2015-04-01 17:28:51.121; org.apache.solr.common.cloud.SolrZkClient; makePath:
/overseer_elect/leader
> INFO  - 2015-04-01 17:28:51.122; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974594-192.168.1.2:8983_solr-n_0000000001)
starting
> INFO  - 2015-04-01 17:28:51.128; org.apache.solr.cloud.OverseerAutoReplicaFailoverThread;
Starting OverseerAutoReplicaFailoverThread autoReplicaFailoverWorkLoopDelay=10000 autoReplicaFailoverWaitAfterExpiration=30000
autoReplicaFailoverBadNodeExpiration=60000
> INFO  - 2015-04-01 17:28:51.128; org.apache.solr.cloud.OverseerCollectionProcessor; Process
current queue of collection creations
> INFO  - 2015-04-01 17:28:51.128; org.apache.solr.common.cloud.ZkStateReader; Updating
cluster state from ZooKeeper... 
> INFO  - 2015-04-01 17:28:51.129; org.apache.solr.cloud.Overseer$ClusterStateUpdater;
Starting to work on the main queue
> INFO  - 2015-04-01 17:28:51.130; org.apache.solr.common.cloud.ZkStateReader; addZkWatch
foo
> INFO  - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; Updating
collection state at /collections/foo/state.json from ZooKeeper... 
> INFO  - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; Updating
data for foo to ver 4 
> INFO  - 2015-04-01 17:28:51.131; org.apache.solr.cloud.ZkController; Register node as
live in ZooKeeper:/live_nodes/192.168.1.2:8983_solr
> INFO  - 2015-04-01 17:28:51.132; org.apache.solr.common.cloud.SolrZkClient; makePath:
/live_nodes/192.168.1.2:8983_solr
> INFO  - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ZkController; Register replica
- core:foo_shard1_replica1 address:http://192.168.1.2:8983/solr collection:foo shard:shard1
> INFO  - 2015-04-01 17:28:51.133; org.apache.solr.common.cloud.ZkStateReader; Updating
data for foo to ver 4 
> INFO  - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ElectionContext; canceling election
/collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000
> WARN  - 2015-04-01 17:28:51.134; org.apache.solr.cloud.ElectionContext; cancelElection
did not find election node to remove /collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000
> INFO  - 2015-04-01 17:28:51.134; org.apache.solr.cloud.Overseer$ClusterStateUpdater;
processMessage: queueSize: 1, message = {
>   "core":"foo_shard1_replica1",
>   "core_node_name":"core_node1",
>   "roles":null,
>   "base_url":"http://192.168.1.2:8983/solr",
>   "node_name":"192.168.1.2:8983_solr",
>   "numShards":"1",
>   "state":"down",
>   "shard":"shard1",
>   "collection":"foo",
>   "operation":"state"} current state version: 1
> INFO  - 2015-04-01 17:28:51.135; org.apache.solr.cloud.overseer.ReplicaMutator; Update
state numShards=1 message={
>   "core":"foo_shard1_replica1",
>   "core_node_name":"core_node1",
>   "roles":null,
>   "base_url":"http://192.168.1.2:8983/solr",
>   "node_name":"192.168.1.2:8983_solr",
>   "numShards":"1",
>   "state":"down",
>   "shard":"shard1",
>   "collection":"foo",
>   "operation":"state"}
> INFO  - 2015-04-01 17:28:51.136; org.apache.solr.cloud.ShardLeaderElectionContext; Running
the leader process for shard shard1
> INFO  - 2015-04-01 17:28:51.136; org.apache.solr.cloud.overseer.ZkStateWriter; going
to update_collection /collections/foo/state.json version: 4
> INFO  - 2015-04-01 17:28:51.136; org.apache.solr.cloud.ActionThrottle; The last leader
attempt started 198060ms ago.
> INFO  - 2015-04-01 17:28:51.137; org.apache.solr.common.cloud.ZkStateReader$7; A cluster
state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/foo/state.json
for collection foo has occurred - updating... (live nodes size: 1)
> INFO  - 2015-04-01 17:28:51.137; org.apache.solr.common.cloud.ZkStateReader; Updating
data for foo to ver 5 
> INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.DistributedQueue$LatchWatcher;
NodeChildrenChanged fired on path /overseer/queue state SyncConnected
> INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.ShardLeaderElectionContext; Enough
replicas found to continue.
> INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.ShardLeaderElectionContext; I
may be the new leader - try and sync
> INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync replicas to
http://192.168.1.2:8983/solr/foo_shard1_replica1/
> INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync Success - now
sync replicas to me
> INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; http://192.168.1.2:8983/solr/foo_shard1_replica1/
has no replicas
> INFO  - 2015-04-01 17:28:51.139; org.apache.solr.cloud.ShardLeaderElectionContext; I
am the new leader: http://192.168.1.2:8983/solr/foo_shard1_replica1/ shard1
> INFO  - 2015-04-01 17:28:51.139; org.apache.solr.common.cloud.SolrZkClient; makePath:
/collections/foo/leaders/shard1
> INFO  - 2015-04-01 17:28:51.139; org.apache.solr.cloud.Overseer$ClusterStateUpdater;
processMessage: queueSize: 1, message = {
>   "operation":"leader",
>   "shard":"shard1",
>   "collection":"foo"} current state version: 1
> INFO  - 2015-04-01 17:28:51.140; org.apache.solr.cloud.overseer.ZkStateWriter; going
to update_collection /collections/foo/state.json version: 5
> INFO  - 2015-04-01 17:28:51.141; org.apache.solr.common.cloud.ZkStateReader$7; A cluster
state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/foo/state.json
for collection foo has occurred - updating... (live nodes size: 1)
> INFO  - 2015-04-01 17:28:51.142; org.apache.solr.cloud.DistributedQueue$LatchWatcher;
NodeChildrenChanged fired on path /overseer/queue state SyncConnected
> INFO  - 2015-04-01 17:28:51.142; org.apache.solr.common.cloud.ZkStateReader; Updating
data for foo to ver 6 
> INFO  - 2015-04-01 17:28:51.144; org.apache.solr.cloud.Overseer$ClusterStateUpdater;
processMessage: queueSize: 1, message = {
>   "operation":"leader",
>   "shard":"shard1",
>   "collection":"foo",
>   "base_url":"http://192.168.1.2:8983/solr",
>   "core":"foo_shard1_replica1",
>   "state":"active"} current state version: 1
> INFO  - 2015-04-01 17:28:51.145; org.apache.solr.cloud.overseer.ZkStateWriter; going
to update_collection /collections/foo/state.json version: 6
> INFO  - 2015-04-01 17:28:51.145; org.apache.solr.common.cloud.ZkStateReader$7; A cluster
state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/foo/state.json
for collection foo has occurred - updating... (live nodes size: 1)
> INFO  - 2015-04-01 17:28:51.146; org.apache.solr.common.cloud.ZkStateReader; Updating
data for foo to ver 7 
> INFO  - 2015-04-01 17:28:51.147; org.apache.solr.cloud.DistributedQueue$LatchWatcher;
NodeChildrenChanged fired on path /overseer/queue state SyncConnected
> INFO  - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; We are http://192.168.1.2:8983/solr/foo_shard1_replica1/
and leader is http://192.168.1.2:8983/solr/foo_shard1_replica1/
> INFO  - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; 
>  In register, core.isReloaded? true 
> INFO  - 2015-04-01 17:28:51.199; org.apache.solr.common.cloud.ZkStateReader; Updating
data for foo to ver 7 
> INFO  - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController; watch zkdir /configs/foo
> INFO  - 2015-04-01 17:28:51.199; org.apache.solr.schema.ZkIndexSchemaReader; Creating
ZooKeeper watch for the managed schema at /configs/foo/managed-schema
> INFO  - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController$4; Running listeners
for /configs/foo
> INFO  - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrCore$11; config update listener
called for core foo_shard1_replica1
> INFO  - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; Current
schema version 5 is already the latest
> INFO  - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; Creating
ZooKeeper watch for the managed schema at /configs/foo/managed-schema
> INFO  - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrConfig; current version of
requestparams : 0
> INFO  - 2015-04-01 17:28:51.201; org.apache.solr.schema.ZkIndexSchemaReader; Current
schema version 5 is already the latest
> INFO  - 2015-04-01 17:28:51.201; org.apache.solr.common.cloud.DefaultConnectionStrategy;
Reconnected to ZooKeeper
> INFO  - 2015-04-01 17:28:51.202; org.apache.solr.common.cloud.ConnectionManager; Connected:true
> INFO  - 2015-04-01 17:28:51.211; org.apache.solr.cloud.ZkController$WatcherImpl; A node
got unwatched for /configs/foo
> INFO  - 2015-04-01 17:28:51.211; org.apache.solr.cloud.ZkController$4; Running listeners
for /configs/foo
> INFO  - 2015-04-01 17:28:51.211; org.apache.solr.core.SolrCore$11; config update listener
called for core foo_shard1_replica1
> INFO  - 2015-04-01 17:28:51.212; org.apache.solr.core.SolrConfig; current version of
requestparams : 0
> INFO  - 2015-04-01 17:29:07.080; org.apache.solr.servlet.SolrDispatchFilter; [admin]
webapp=null path=/admin/cores params={indexInfo=false&wt=json&_=1427909347078} status=0
QTime=0 
> INFO  - 2015-04-01 17:29:07.112; org.apache.solr.servlet.SolrDispatchFilter; [admin]
webapp=null path=/admin/info/system params={wt=json&_=1427909347099} status=0 QTime=11

> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message