lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jessica Cheng (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-6095) SolrCloud cluster can end up without an overseer
Date Mon, 16 Jun 2014 19:33:01 GMT

    [ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14032814#comment-14032814
] 

Jessica Cheng edited comment on SOLR-6095 at 6/16/14 7:31 PM:
--------------------------------------------------------------

Sorry that wasn't true. You were comparing election path, not the leader node. However, this
still possibly doesn't work because sortSeqs extracts just the sequence number (n_0000000001)
out of the entire node string and sorts based on that, so in fact the sort order of nodeB
and nodeD might not be deterministic in different JVM (calls to zk.getChildren does not guarantee
return list ordering, so even though the Collections.sort is stable, the original list is
not), which makes this new if statement also non-deterministic.


was (Author: mewmewball):
Sorry that wasn't true. You were comparing election path, not the leader node. However, this
still possibly doesn't work because sortSeqs extracts just the sequence number (n_0000000001)
out of the entire node string and sorts based on that, so in fact the sort order of nodeB
and nodeD might not be deterministic in different JVM, which makes this new if statement also
non-deterministic.

> SolrCloud cluster can end up without an overseer
> ------------------------------------------------
>
>                 Key: SOLR-6095
>                 URL: https://issues.apache.org/jira/browse/SOLR-6095
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.8
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Noble Paul
>             Fix For: 4.9, 5.0
>
>         Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch
>
>
> We have a large cluster running on ec2 which occasionally ends up without an overseer
after a rolling restart. We always restart our overseer nodes at the very last otherwise we
end up with a large number of shards that can't recover properly.
> This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495
and SOLR-5468 applied. We have a large number of small collections (120 collections each with
approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate
two specified nodes as overseers. However, I think the problem that we're seeing is not specific
to the overseer roles feature.
> As soon as the overseer was shutdown, we saw the following on the node which was next
in line to become the overseer:
> {code}
> 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  - I am going
to be the leader ec2-xxxxxxxxxx.compute-1.amazonaws.com:8987_solr
> 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
for /overseer_elect/leader
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
> 	at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
> 	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
> 	at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
> 	at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
> 	at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
> 	at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
> 	at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
> 	at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
> 	at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
> 	at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> {code}
> When the overseer leader node is gracefully shutdown, we get the following in the logs:
> {code}
> 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in Overseer
main queue loop
> org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
> 	at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
> 	at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
> 	at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
> 	at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
> 	at java.lang.Object.wait(Native Method)
> 	at java.lang.Object.wait(Object.java:503)
> 	at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
> 	at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
> 	at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
> 	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
> 	at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
> 	at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
> 	... 4 more
> 2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - Overseer Loop exiting
: ec2-xxxxxxxxxx.compute-1.amazonaws.com:8986_solr
> 2014-05-20 09:55:39,256 [main-EventThread] WARN  common.cloud.ZkStateReader  - ZooKeeper
watch triggered, but Solr cannot talk to ZK
> 2014-05-20 09:55:39,259 [ShutdownMonitor] INFO  server.handler.ContextHandler  - stopped
o.e.j.w.WebAppContext{/solr,file:/vol0/cloud86/solr-webapp/webapp/},/vol0/cloud86/webapps/solr.war
> {code}
> Notice how the overseer kept on running almost till the last point i.e. until the jetty
context stopped. On some runs, we got the following on the overseer leader node on graceful
shutdown:
> {code}
> 2014-05-19 21:33:43,657 [Thread-75] ERROR solr.cloud.Overseer  - Exception in Overseer
main queue loop
> org.apache.solr.common.SolrException: Could not load collection from ZK:sm71
> 	at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
> 	at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
> 	at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
> 	at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
> 	at java.lang.Object.wait(Native Method)
> 	at java.lang.Object.wait(Object.java:503)
> 	at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1153)
> 	at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:277)
> 	at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:274)
> 	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
> 	at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:274)
> 	at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:769)
> 	... 4 more
> 2014-05-19 21:33:43,662 [main-EventThread] WARN  common.cloud.ZkStateReader  - ZooKeeper
watch triggered, but Solr cannot talk to ZK
> 2014-05-19 21:33:43,663 [Thread-75] INFO  solr.cloud.Overseer  - Overseer Loop exiting
: ec2-xxxxxxxxxxxx.compute-1.amazonaws.com:8987_solr
> 2014-05-19 21:33:43,664 [OverseerExitThread] ERROR solr.cloud.Overseer  - could not read
the data
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /overseer_elect/leader
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
> 	at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:277)
> 	at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:274)
> 	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
> 	at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:274)
> 	at org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:329)
> 	at org.apache.solr.cloud.Overseer$ClusterStateUpdater.access$300(Overseer.java:85)
> 	at org.apache.solr.cloud.Overseer$ClusterStateUpdater$1.run(Overseer.java:293)
> 2014-05-19 21:33:43,665 [ShutdownMonitor] INFO  server.handler.ContextHandler  - stopped
o.e.j.w.WebAppContext{/solr,file:/vol0/cloud87/solr-webapp/webapp/},/vol0/cloud87/webapps/solr.war
> {code}
> Again the overseer was clinging on till the last moment and by the time it exited the
ZK session expired and it couldn't delete the /overseer_elect/leader node. The exception on
the next-in-line node was the same i.e. NodeExists for /overseer_elect/leader.
> In both cases, we are left with no overseers after restart. I can easily reproduce this
problem by just restarting overseer leader nodes repeatedly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message