db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bergquist, Brett" <BBergqu...@canoga.com>
Subject RE: Question on recoverying after replication break because of a system failure
Date Tue, 14 Jan 2014 14:39:48 GMT
Are there no comments on this?  Just looking for some feedback on this to see if it might be
an avenue in pursuing.

Or maybe another approach that might be better.   Conceptually:

*         Create a new procedure to allow Derby to "prepare for replication" that would be
executed on the slave.   This would accept the output of an online backup and any changes
since that occurred (in the form of a processing the logs I guess) and would switch to replication
mode when instructed

*         Create a new procedure to allow Derby to "initiate replication" that would be executed
on the master.   This would perform the equivalent of the online backup with log archive mode
(to keep track of the changes of the database since the backup was started) and ship the backup
and logs to the slave where they would be processed to get the slave database in sync with
the master and then switch to replication mode.

What this would try to achieve would be to get the slave up to date with the master and then
process I replication mode while not requiring the downtime to the master.   The master would
continue to run just like it does during an online backup and then once the slave has a copy
of the database up to the point where it is consistent with the master, replication would
be performed.

Any thoughts on this?

From: Bergquist, Brett [mailto:BBergquist@canoga.com]
Sent: Friday, January 10, 2014 4:45 PM
To: derby-dev@db.apache.org
Subject: Question on recoverying after replication break because of a system failure

The reason I am posting to the dev list is that I might want to look into improving Derby
in this area.

Just so that I am understand correctly, the steps for replication are:

*         Make a copy of the database to the slave

*         Start replication on the slave and on the master

Now assume that this is working right along and all is well and then the system with the master
fails.   So replication is broke and then the slave can be restarted in non-replication mode.
  Time goes along and changes are made to the non-replicated database on the slave.   Finally
the master machine is brought back on line.

So to get replication going we need to:

*         Copy the database from the slave to the master

*         Start replication on the slave and on the master

This assumes that we have an affinity for having the master being the master but even if this
is not the case and the old slave is going to become the new master, we need to copy the database
from the slave to the master before starting replication again.

Given a database that is fairly large (say on the order of 200Gb) and not a Gig connection
between the master and slave, this could be a fairly long time for the transfer to occur.
  Unfortunately during this transfer time, neither database can be used.    So while replication
allows quick fail over in an initial failure, re-establishing the replication when the failure
has been resolved can cause a substantial long downtime.

So my question, is there any way that this downtime can be reduced?   Could something be done
with restoring a backup database and use the logs and then enable replication.     Something

*         Make a file system level backup of the slave (using something like freeze and ZFS
snapshot, this can take only a couple of seconds) and then allow the slave to continue

o   Assuming that the database logs are being used so that they can be replayed later

*         Transfer the database to the master

*         Transfer the logs

o   Replay each log on the master somehow to get the master to catch up to the slave as close
as possible

*         Stop the slave so that it becomes consistent

*         Transfer the last log to the master and replay the master log

*         Enable replication on the master and the slave

Basically limiting the downtime while the database transfer and log file transfer is taking
place and then to have a small window of down time where they databases need to become in
sync and then replication can be started again.

Any thoughts on this?   Is this an approach that is worth looking at?

View raw message