lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Bickerstaff <j...@johnbickerstaff.com>
Subject Failure to load shards
Date Fri, 09 Jun 2017 18:03:51 GMT
Hi all,

Here's my situation...

In AWS with zookeeper / solr.

When trying to spin up additional Solr boxes from an "auto scaling group" I
get this failure.

The code used is exactly the same code that successfully spun up the first
3 or 4 solr boxes in each "auto scaling group"

Below is a copy of my email to some of my compatriots within the company
who also use solr/zookeeper....

I'm looking for any advice on what _might_ be the cause of this failure...
Overload on Zookeeper in some way is our best guess.

I know this isn't a zookeeper forum - - just hoping someone out there has
some experience troubleshooting similar issues.

Many thanks in advance...

=====

We have 6 zookeepers. (3 of them are observers).

They are not under a load balancer

How do I check if zookeeper nodes are under heavy load?


The problem arises when we try to scale up with more solr nodes. Current
setup we have 160 nodes connected to zookeeper. Each node with 40 cores, so
around 6400 cores. When we scale up, 40 to 80 solr nodes will spin up at
one time.

And we are getting errors like these that stops the index distribution
process:

2017-06-05 20:06:34.357 ERROR [pool-3-thread-2] o.a.s.c.CoreContainer -
Error creating core [p44_b1_s37]: Could not get shard id for core:
p44_b1_s37


org.apache.solr.common.SolrException: Could not get shard id for core:
p44_b1_s37

at org.apache.solr.cloud.ZkController.waitForShardId(ZkController.java:1496)

at
org.apache.solr.cloud.ZkController.doGetShardIdAndNodeNameProcess(ZkController.java:1438)

at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1548)

at org.apache.solr.core.CoreContainer.create(CoreContainer.java:815)

at org.apache.solr.core.CoreContainer.create(CoreContainer.java:757)

at com.ancestry.solr.servlet.AcomServlet.indexTransfer(AcomServlet.java:319)

at
com.ancestry.solr.servlet.AcomServlet.lambda$indexTransferStart$1(AcomServlet.java:303)

at
com.ancestry.solr.service.IndexTransferWorker.run(IndexTransferWorker.java:78)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)


Which we predict has to do with zookeeper not responding fast enough.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message