lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kai 'wusel' Siering" <wusel...@uu.org>
Subject How to recover from failed SPLITSHARD?
Date Thu, 28 Sep 2017 23:11:11 GMT
Hi,

this is with SolrCloud 6.5.1 on Ubuntu LTS 16.04 and OpenJDK 8, 4 Solr in Cloud mode, external
ZK.

I tried to split my colection's shard1 (500 GB) with SPLITSHARD, it kind of worked. After
more than 8 hours the new shards left "construction" state — and entered "recovery" :( Another
about 12 hours later, Out of Memory errors with "could not create thread" happened. Node 10.10.10.162
took leadership of shard1, but since we still saw errors on searches, I stopped solr on 10.10.10.161,
changed heap from 24G to 31G and rebooted the system, just in case — good time to install
latest patches. 10.10.10.161 came back and shards shard1, shard1_0 and shard1_1 started recovery.
But unfortunately, 10.10.10.162, leader for shard2 which was being split as well, hit "something":
solr.log got not updated anymore, the UI didn't work anymore, so in the end, I stopped solr
there as well (finished instantly) and rebootet. Now both are running with 31G java heap,
shard1 and shard2 are synced and I try to clean up before retrying.

Of shard2, only a shard2_0 without any replicas was left over, and DELETESHARD clean it up.

But shard1 has shard1_0 and shard1_1, each with two replicas. DELETESHARD errored out, so
I DELETEREPLICA all of them. This worked, but "parts of" shard1_0 and shard1_1 are still there
and I cannot delete them:

$ wget -q -O - 'http://10.10.10.162:8983/solr/admin/collections?wt=json&action=CLUSTERSTATUS'
| jq
[…]
          "shard1_0": {
            "range": "80000000-bfffffff",
            "state": "recovery_failed",
            "replicas": {}
          },
          "shard1_1": {
            "parent": "shard1",
            "shard_parent_node": "10.10.10.161:8983_solr",
            "range": "c0000000-ffffffff",
            "state": "recovery_failed",
            "shard_parent_zk_session": "98682039611162624",
            "replicas": {}
          }
[…]


$ wget -O - 'http://10.10.10.161:8983/solr/admin/collections?action=DELETESHARD&shard=shard1_1&collection=collection'
--2017-09-29 01:01:16--  http://10.10.10.161:8983/solr/admin/collections?action=DELETESHARD&shard=shard1_1&collection=collection
Connecting to 10.10.10.161:8983... connected.
HTTP request sent, awaiting response... 400 Bad Request
2017-09-29 01:01:16 ERROR 400: Bad Request.

Any hint on how to fix this appreciated ;)

Regards,
-kai




Mime
View raw message