Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of markrmiller@gmail.com
 designates 209.85.216.173 as permitted sender)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 7.1 \(1827\))
Subject: Re: shard1 gone missing ...
From: Mark Miller <markrmiller@gmail.com>
In-Reply-To: <52EBBDA9.3080901@gmail.com>
Date: Fri, 31 Jan 2014 10:22:55 -0500
Content-Transfer-Encoding: quoted-printable
Message-Id: <6143D0E3-E007-451C-B1CB-AAEB2A0118F9@gmail.com>
References: <52EBBDA9.3080901@gmail.com>
To: solr-user <solr-user@lucene.apache.org>

Would probably need to see some logs to have an idea of what happened.

Would also be nice to see the after state of zk in a text dump.

You should be able to fix it, as long as you have the index on a disk, =
just make sure it is where it is expected and manually update the =
clusterstate.json. Would be good to take a look at the logs and see if =
it tells anything first though.

I=92d also highly recommend you try moving to Solr 4.6.1 when you can =
though. We have fixed many, many, many bugs around SolrCloud in the 4 =
releases since 4.4. You can follow the progress in the CHANGES file we =
update for each release.

I wrote a little about the 4.6.1 as it relates to SolrCloud here: =
https://plus.google.com/+MarkMillerMan/posts/CigxUPN4hbA

- Mark

http://about.me/markrmiller

On Jan 31, 2014, at 10:13 AM, David Santamauro =
<david.santamauro@gmail.com> wrote:

>=20
> Hi,
>=20
> I have a strange situation. I created a collection with 4 ndoes =
(separate servers, numShards=3D4), I then proceeded to index data ... =
all has been seemingly well until this morning when I had to reboot one =
of the nodes.
>=20
> After reboot, the node I rebooted went into recovery mode! This is =
completely illogical as there is 1 shard per node (no replicas).
>=20
> What could have possibly happened to 1) trigger a recovery and; 2) =
have the node think it has a replica to even recover from?
>=20
> Looking at the graph from the SOLR admin page it shows that shard1 =
disappeared and the server that was rebooted appears in a recovering =
state under the server home to shard2.
>=20
> I then looked at clusterstate.json and it confirms that shard1 is =
completely missing and shard2 now has a replica. ... I'm baffled, =
confused, dismayed.
>=20
> Versions:
> Solr 4.4 (4 nodes with tomcat container)
> zookeeper-3.4.5 (5-node ensemble)
>=20
> Oh, and I'm assuming shard1 is completely corrupt.
>=20
> I'd really appreciate any insight.
>=20
> David
>=20
> PS I have a copy of all the shards backed up. Is there a way to =
possibly rsync shard1 back into place and "fix" clusterstate.json =
manually?