lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Potter <tim.pot...@lucidworks.com>
Subject RE: Solr failure results in misreplication?
Date Wed, 18 Dec 2013 15:09:48 GMT
Any chance you still have the logs from the servers hosting 1 & 2? I would open a JIRA
ticket for this one as it sounds like something went terribly wrong on restart. 

You can update the /clusterstate.json to fix this situation.

Lastly, it's recommended to use an OOM killer script with SolrCloud so that you don't end
up with zombie nodes hanging around in your cluster. I use something like: -XX:OnOutOfMemoryError="$SCRIPT_DIR/oom_solr.sh
$x %p"

$x in start script is the port # and %p is the process ID ... My oom_solr.sh script is something
like this:

#!/bin/bash
SOLR_PORT=$1
SOLR_PID=$2
NOW=$(date +"%F%T")
(
echo "Running OOM killer script for process $SOLR_PID for Solr on port 89$SOLR_PORT"
kill -9 $SOLR_PID
echo "Killed process $SOLR_PID"
) | tee oom_killer-89$SOLR_PORT-$NOW.log

I use supervisord do handle the restart after the process gets killed by the OOM killer, which
is why you don't see the restart in this script ;-)

Timothy Potter
Sr. Software Engineer, LucidWorks
www.lucidworks.com

________________________________________
From: youknowwho@heroicefforts.net <youknowwho@heroicefforts.net>
Sent: Tuesday, December 17, 2013 10:31 PM
To: solr-user@lucene.apache.org
Subject: Solr failure results in misreplication?

My client has a test cluster Solr 4.6 with three instances 1, 2, and 3 hosting shards 1, 2,
and 3, respectively.  There is no replication in this cluster.  We started receiving OOME
during indexing; likely the batches were too large.  The cluster was rebooted to restore the
system.  However, upon reboot, instance 2 now shows as a replica of shard 1 and its shard2
is down with a null range.  Instance 2 is queryable shards.tolerant=true&distribute=false
and returns a different set of records than instance 1 (as would be expected during normal
operations).  Clusterstate.json is similar to the following:

mycollection:{
shard1:{
range:8000000-d554ffff,
state:active,
replicas:{
instance1....state:active...,
instance2....state:active...
}
},
shard3:{....state:active.....},
shard2:{
range:null,
state:active,
replicas:{
instance2{....state:down....}
}
},
maxShardsPerNode:1,
replicationFactor:1
}

Any ideas on how this would come to pass?  Would manually correcting the clusterstate.json
in Zk correct this situation?
Mime
View raw message