db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bergquist, Brett" <BBergqu...@canoga.com>
Subject RE: Question on recoverying after replication break because of a system failure
Date Tue, 14 Jan 2014 17:48:43 GMT
Actually the expensive part is having the "master" system down to ensure a completely accurate
copy of the database is being made on the "slave".  Note that my "master" here could be actually
be the (original) slave system when the (original) master system is repaired.   

Derby's replication once the systems are in sync and running seems to be okay.   It is the
initial setup time to get the "slave" database to be the same as the "master" database that
is expensive because currently (unless I am wrong and correct me here if so), the master cannot
be modified while this is occurring.  Then again, restoring to the replication state once
a failed system is repaired is again expensive.

I guess I will look at how other database handle this case.  I can't imagine that adding a
"replication slave" requires that the master database being down and quiescent.  I would image
that it is possible to add a "replication slave" while the "replication master" is hot and
running.   This is what I would like Derby to be able to do (note that I am not asking for
someone else to do it, as it could very well be a contribution from me).

An analogy would be replacing a failed disk in a RAID array.  The RAID array continues to
operate with the failed disk installed.  Now the disk is removed and a new one is installed.
  Access to the RAID array is not blocked while the RAID rebuilds the data on the missing

It would be real useful for Derby to operate similarly whereby the replication database can
be rebuilt in the background.  Note that while this is being done the replication is degraded
(not operating of course with the current one-to-one replication) just is a RAID array is
while the disk is being resilvered, but once this process is done, then the replication is
up and running.

-----Original Message-----
From: Rick Hillegas [mailto:rick.hillegas@oracle.com] 
Sent: Tuesday, January 14, 2014 11:40 AM
To: derby-dev@db.apache.org
Subject: Re: Question on recoverying after replication break because of a system failure

Hi Brett,

I'm afraid that I'm not following your proposal. Some comments inline...

On 1/10/14 1:45 PM, Bergquist, Brett wrote:
> The reason I am posting to the dev list is that I might want to look 
> into improving Derby in this area.
> Just so that I am understand correctly, the steps for replication are:
> *Make a copy of the database to the slave
This seems to be the expensive step which results in long downtime.
> *Start replication on the slave and on the master
> Now assume that this is working right along and all is well and then 
> the system with the master fails.   So replication is broke and then 
> the slave can be restarted in non-replication mode.   Time goes along 
> and changes are made to the non-replicated database on the slave.   
> Finally the master machine is brought back on line.
> So to get replication going we need to:
> *Copy the database from the slave to the master
> *Start replication on the slave and on the master
> This assumes that we have an affinity for having the master being the 
> master but even if this is not the case and the old slave is going to 
> become the new master, we need to copy the database from the slave to 
> the master before starting replication again.
> Given a database that is fairly large (say on the order of 200Gb) and 
> not a Gig connection between the master and slave, this could be a
> fairly long time for the transfer to occur.   Unfortunately during 
> this transfer time, neither database can be used.    So while 
> replication allows quick fail over in an initial failure, 
> re-establishing the replication when the failure has been resolved can 
> cause a substantial long downtime.
> So my question, is there any way that this downtime can be reduced?   
> Could something be done with restoring a backup database and use the 
> logs and then enable replication.     Something like:
> *Make a file system level backup of the slave (using something like 
> freeze and ZFS snapshot, this can take only a couple of seconds) and 
> then allow the slave to continue
> oAssuming that the database logs are being used so that they can be 
> replayed later
> *Transfer the database to the master
I don't understand how this step is different from the expensive step you want to eliminate.

> *Transfer the logs
> oReplay each log on the master somehow to get the master to catch up 
> to the slave as close as possible
> *Stop the slave so that it becomes consistent
> *Transfer the last log to the master and replay the master log
> *Enable replication on the master and the slave
> Basically limiting the downtime while the database transfer and log 
> file transfer is taking place and then to have a small window of down 
> time where they databases need to become in sync and then replication 
> can be started again.
> Any thoughts on this?   Is this an approach that is worth looking at?

View raw message