lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (Commented) (JIRA)" <>
Subject [jira] [Commented] (SOLR-3274) ZooKeeper related SolrCloud problems
Date Thu, 29 Mar 2012 12:44:26 GMT


Mark Miller commented on SOLR-3274:

If you don't solve the issue of the zk expirations, this is no real surprise. The larger the
index gets, the longer the recoveries can take - until you end up in a similar situation as
you had. The key is understanding why the connection to zookeeper is dropping. 
> ZooKeeper related SolrCloud problems
> ------------------------------------
>                 Key: SOLR-3274
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Any
>            Reporter: Per Steffensen
>            Assignee: Mark Miller
> Same setup as in SOLR-3273. Well if I have to tell the entire truth we have 7 Solr servers,
running 28 slices of the same collection (collA) - all slices have one replica (two shards
all in all - leader + replica) - 56 cores all in all (8 shards on each solr instance). But
> Besides the problem reported in SOLR-3273, the system seems to run fine under high load
for several hours, but eventually errors like the ones shown below start to occur. I might
be wrong, but they all seem to indicate some kind of unstability in the collaboration between
Solr and ZooKeeper. I have to say that I havnt been there to check ZooKeeper "at the moment
where those exception occur", but basically I dont believe the exceptions occur because ZooKeeper
is not running stable - at least when I go and check ZooKeeper through other "channels" (e.g.
my eclipse ZK plugin) it is always accepting my connection and generally seems to be doing
> Exception 1) Often the first error we see in solr.log is something like this
> {code}
> Mar 22, 2012 5:06:43 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates are
>         at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(
>         at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(
>         at org.apache.solr.handler.XMLLoader.processUpdate(
>         at org.apache.solr.handler.XMLLoader.load(
>         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(
>         at org.apache.solr.core.SolrCore.execute(
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>         at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(
>         at org.mortbay.jetty.servlet.ServletHandler.handle(
>         at
>         at org.mortbay.jetty.servlet.SessionHandler.handle(
>         at org.mortbay.jetty.handler.ContextHandler.handle(
>         at org.mortbay.jetty.webapp.WebAppContext.handle(
>         at org.mortbay.jetty.handler.ContextHandlerCollection.handle(
>         at org.mortbay.jetty.handler.HandlerCollection.handle(
>         at org.mortbay.jetty.handler.HandlerWrapper.handle(
>         at org.mortbay.jetty.Server.handle(
>         at org.mortbay.jetty.HttpConnection.handleRequest(
>         at org.mortbay.jetty.HttpConnection$RequestHandler.content(
>         at org.mortbay.jetty.HttpParser.parseNext(
>         at org.mortbay.jetty.HttpParser.parseAvailable(
>         at org.mortbay.jetty.HttpConnection.handle(
>         at$
>         at org.mortbay.thread.QueuedThreadPool$
> {code}
> I believe this error basically occurs because SolrZkClient.isConnected reports false,
which means that its internal "keeper.getState" does not return ZooKeeper.States.CONNECTED.
Im pretty sure that it has been CONNECTED for a long time, since this error starts occuring
after several hours of processing without this problem showing. But why is it suddenly not
connected anymore?!
> Exception 2) We also see errors like the following, and if Im not mistaken, they start
occuring shortly after "Exception 1)" (above) shows for the fist time
> {code}
> Mar 22, 2012 5:07:26 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: no servers hosting shard: 
>         at org.apache.solr.handler.component.HttpShardHandler$
>         at org.apache.solr.handler.component.HttpShardHandler$
>         at java.util.concurrent.FutureTask$Sync.innerRun(
>         at
>         at java.util.concurrent.Executors$
>         at java.util.concurrent.FutureTask$Sync.innerRun(
>         at
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(
>         at java.util.concurrent.ThreadPoolExecutor$
>         at
> {code}
> Please note that the exception says "no servers hosting shard: <blank>". Looking
at the code a "shard"-string was actually supposed to be written at <blank>.  Basically
this means that HttpShardHandler.submit was called with an empty "shard"-string parameter.
But who does this? CoreAdminHandler.handleDistribUrlAction or SearchHandler.handleRequestBody
or SyncStrategy or PeerSync or... I dont know, and maybe it is not that relevant, because
I guess they all get the "shard"-string from ZooKeeper. Again something pointing in the direction
of unstable collaboration between Solr and ZooKeeper.
> Exception 3) We also see exceptions like this
> {code}
> Mar 25, 2012 3:05:38 PM$3 process
> WARNING: ZooKeeper watch triggered, but Solr cannot talk to ZK
> Mar 25, 2012 3:05:38 PM$1 process
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /collections/collA/leader_elect/slice26/election
>         at org.apache.zookeeper.KeeperException.create(
>         at org.apache.zookeeper.KeeperException.create(
>         at org.apache.zookeeper.ZooKeeper.getChildren(
>         at$6.execute(
>         at$6.execute(
>         at
>         at
>         at
>         at$000(
>         at$1.process(
>         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(
>         at org.apache.zookeeper.ClientCnxn$
> {code}
> Maybe this will we usable for some bug-fixing or for making the code more stable. I know
4.0 is not stable/released yet, and that we therefore should expect this kind of errors at
the moment. So this is not negative criticism - just reporting of issues observed when using
SolrCloud features under high load for several days. Any feedback is more than welcome.
> Regards, Per Steffensen

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message