lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Smith <>
Subject ReplicationHandler reports incorrect replication failures
Date Fri, 26 Mar 2010 13:59:37 GMT
We're using Solr 1.4 Java replication, which seems to be working
nicely.  While writing production monitors to check that replication
is healthy, I think we've run into a bug in the status reporting of
the "../solr/replication?command=details" command.  (I know it's

Our monitor parses the replication?command=details XML and checks that
replication lag is reasonable by diffing the indexVersion of the
master and slave indices to make sure it's within a reasonable time

Our monitor also compares the first elements of
"indexReplicatedAtList" and "replicationFailedAtList" lists to see if
the last replication attempt failed.  This is where we're having a
problem with the monitor throwing false errors.  It looks like there's
a bug that causes successful replications to be considered failures.
The bug is triggered immediately after a slave restarts when the slave
is already in sync with the master.  Each no-op replication attempt
after restart is considered a failure until something on the master
changes and replication has to actually do work.

>From the code, it looks like "SnapPuller.successfulInstall" starts out
false on restart.  If the slave starts out in sync with the master,
then each no-op replication poll leaves "successfulInstall" set to
false which makes SnapPuller.logReplicationTimeAndConfFiles log the
poll as a failure.  SnapPuller.successfulInstall stays false until the
first time replication actually has to do something, at which point it
gets set to true, and then everything is OK.


View raw message