db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bergquist, Brett" <BBergqu...@canoga.com>
Subject Question on recoverying after replication break because of a system failure
Date Fri, 10 Jan 2014 21:45:22 GMT
The reason I am posting to the dev list is that I might want to look into improving Derby in
this area.

Just so that I am understand correctly, the steps for replication are:

*         Make a copy of the database to the slave

*         Start replication on the slave and on the master

Now assume that this is working right along and all is well and then the system with the master
fails.   So replication is broke and then the slave can be restarted in non-replication mode.
  Time goes along and changes are made to the non-replicated database on the slave.   Finally
the master machine is brought back on line.

So to get replication going we need to:

*         Copy the database from the slave to the master

*         Start replication on the slave and on the master

This assumes that we have an affinity for having the master being the master but even if this
is not the case and the old slave is going to become the new master, we need to copy the database
from the slave to the master before starting replication again.

Given a database that is fairly large (say on the order of 200Gb) and not a Gig connection
between the master and slave, this could be a fairly long time for the transfer to occur.
  Unfortunately during this transfer time, neither database can be used.    So while replication
allows quick fail over in an initial failure, re-establishing the replication when the failure
has been resolved can cause a substantial long downtime.

So my question, is there any way that this downtime can be reduced?   Could something be done
with restoring a backup database and use the logs and then enable replication.     Something

*         Make a file system level backup of the slave (using something like freeze and ZFS
snapshot, this can take only a couple of seconds) and then allow the slave to continue

o   Assuming that the database logs are being used so that they can be replayed later

*         Transfer the database to the master

*         Transfer the logs

o   Replay each log on the master somehow to get the master to catch up to the slave as close
as possible

*         Stop the slave so that it becomes consistent

*         Transfer the last log to the master and replay the master log

*         Enable replication on the master and the slave

Basically limiting the downtime while the database transfer and log file transfer is taking
place and then to have a small window of down time where they databases need to become in
sync and then replication can be started again.

Any thoughts on this?   Is this an approach that is worth looking at?

View raw message