lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3274) ZooKeeper related SolrCloud problems
Date Mon, 26 Mar 2012 14:30:28 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238431#comment-13238431
] 

Mark Miller commented on SOLR-3274:
-----------------------------------

bq. Uhhhh 10 secs is A LOT OF TIME

It really depends - I've seen that timeout broken on a heavily loaded machine more than a
few times. Then you have to add in any network delays. But yeah, on a fast machine under normal
to high load, I have not really run into a problem with this timeout.

bq. Then basically my options are to setup a more responsive ZK cluster or maybe raise the
ZK timeout on Solr side.

That's all I can suggest. If the ZooKeeper client loses the connection, it has up to the session
timeout to reconnect. Once it reconnects, if more than the session timeout has passed, you
will get the SessionExpiredException. If that happens, the node will go into recovery. If
it's in recovery, it won't serve search requests until recovery is finished - so that could
also contribute to the "no servers hosting shard" issue.

Let me know how it goes and if you can pin point any problems.
                
> ZooKeeper related SolrCloud problems
> ------------------------------------
>
>                 Key: SOLR-3274
>                 URL: https://issues.apache.org/jira/browse/SOLR-3274
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Any
>            Reporter: Per Steffensen
>            Assignee: Mark Miller
>
> Same setup as in SOLR-3273. Well if I have to tell the entire truth we have 7 Solr servers,
running 28 slices of the same collection (collA) - all slices have one replica (two shards
all in all - leader + replica) - 56 cores all in all (8 shards on each solr instance). But
anyways...
> Besides the problem reported in SOLR-3273, the system seems to run fine under high load
for several hours, but eventually errors like the ones shown below start to occur. I might
be wrong, but they all seem to indicate some kind of unstability in the collaboration between
Solr and ZooKeeper. I have to say that I havnt been there to check ZooKeeper "at the moment
where those exception occur", but basically I dont believe the exceptions occur because ZooKeeper
is not running stable - at least when I go and check ZooKeeper through other "channels" (e.g.
my eclipse ZK plugin) it is always accepting my connection and generally seems to be doing
fine.
> Exception 1) Often the first error we see in solr.log is something like this
> {code}
> Mar 22, 2012 5:06:43 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates are
disabled.
>         at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:678)
>         at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:250)
>         at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)
>         at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:80)
>         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:59)
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:407)
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
>         at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>         at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>         at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>         at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>         at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>         at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>         at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>         at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>         at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>         at org.mortbay.jetty.Server.handle(Server.java:326)
>         at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>         at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
>         at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
>         at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>         at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>         at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
>         at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> {code}
> I believe this error basically occurs because SolrZkClient.isConnected reports false,
which means that its internal "keeper.getState" does not return ZooKeeper.States.CONNECTED.
Im pretty sure that it has been CONNECTED for a long time, since this error starts occuring
after several hours of processing without this problem showing. But why is it suddenly not
connected anymore?!
> Exception 2) We also see errors like the following, and if Im not mistaken, they start
occuring shortly after "Exception 1)" (above) shows for the fist time
> {code}
> Mar 22, 2012 5:07:26 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: no servers hosting shard: 
>         at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:149)
>         at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:123)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> {code}
> Please note that the exception says "no servers hosting shard: <blank>". Looking
at the code a "shard"-string was actually supposed to be written at <blank>.  Basically
this means that HttpShardHandler.submit was called with an empty "shard"-string parameter.
But who does this? CoreAdminHandler.handleDistribUrlAction or SearchHandler.handleRequestBody
or SyncStrategy or PeerSync or... I dont know, and maybe it is not that relevant, because
I guess they all get the "shard"-string from ZooKeeper. Again something pointing in the direction
of unstable collaboration between Solr and ZooKeeper.
> Exception 3) We also see exceptions like this
> {code}
> Mar 25, 2012 3:05:38 PM org.apache.solr.common.cloud.ZkStateReader$3 process
> WARNING: ZooKeeper watch triggered, but Solr cannot talk to ZK
> Mar 25, 2012 3:05:38 PM org.apache.solr.cloud.LeaderElector$1 process
> WARNING: 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /collections/collA/leader_elect/slice26/election
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1249)
>         at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:266)
>         at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:263)
>         at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>         at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:263)
>         at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:92)
>         at org.apache.solr.cloud.LeaderElector.access$000(LeaderElector.java:57)
>         at org.apache.solr.cloud.LeaderElector$1.process(LeaderElector.java:121)
>         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
>         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> {code}
> Maybe this will we usable for some bug-fixing or for making the code more stable. I know
4.0 is not stable/released yet, and that we therefore should expect this kind of errors at
the moment. So this is not negative criticism - just reporting of issues observed when using
SolrCloud features under high load for several days. Any feedback is more than welcome.
> Regards, Per Steffensen

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message