geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GEODE-657) Two backups occurring simultaneously can cause the BackupLock to be obtained but not released
Date Fri, 12 Feb 2016 01:38:18 GMT

    [ https://issues.apache.org/jira/browse/GEODE-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143864#comment-15143864
] 

ASF subversion and git services commented on GEODE-657:
-------------------------------------------------------

Commit 3adb0b8e17436af555e1acbe2bf729bf72365305 in incubator-geode's branch refs/heads/develop
from [~upthewaterspout]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=3adb0b8 ]

GEODE-657: Add an abort flag to the FinishBackupMessage

If the prepare phase of a backup fails, it can leave the backup in a
prepared state on several nodes. The FinishBackupMessage should always
be sent to cleanup the backup. Adding an abort flag to
FinishBackupMessage and always sending that message regardless of
failures during the prepare.


> Two backups occurring simultaneously can cause the BackupLock to be obtained but not
released
> ---------------------------------------------------------------------------------------------
>
>                 Key: GEODE-657
>                 URL: https://issues.apache.org/jira/browse/GEODE-657
>             Project: Geode
>          Issue Type: Bug
>          Components: persistence
>            Reporter: Barry Oglesby
>            Assignee: Dan Smith
>
> Backup is done in three phases, namely {{FlushToDiskRequest}}, {{PrepareBackupRequest}}
and {{FinishBackupRequest}}.
> Here is some debugging in the locator showing the backup phases:
> {noformat}
> [warning 2015/12/01 14:37:38.783 PST locator <RMI TCP Connection(3)-192.168.2.11>
tid=0x99] AdminDistributedSystemImpl backupAllMembers STARTED
> [warning 2015/12/01 14:37:38.794 PST locator <RMI TCP Connection(3)-192.168.2.11>
tid=0x99] AdminDistributedSystemImpl backupAllMembers about to send FlushToDiskRequest
> [warning 2015/12/01 14:37:38.977 PST locator <RMI TCP Connection(3)-192.168.2.11>
tid=0x99] AdminDistributedSystemImpl backupAllMembers received all responses to FlushToDiskRequest
> [warning 2015/12/01 14:37:38.977 PST locator <RMI TCP Connection(3)-192.168.2.11>
tid=0x99] AdminDistributedSystemImpl backupAllMembers about to send PrepareBackupRequest
> [warning 2015/12/01 14:37:38.984 PST locator <RMI TCP Connection(3)-192.168.2.11>
tid=0x99] AdminDistributedSystemImpl backupAllMembers received all responses to PrepareBackupRequest
> [warning 2015/12/01 14:37:38.985 PST locator <RMI TCP Connection(3)-192.168.2.11>
tid=0x99] AdminDistributedSystemImpl backupAllMembers about to send FinishBackupRequest
> [warning 2015/12/01 14:37:51.184 PST locator <RMI TCP Connection(3)-192.168.2.11>
tid=0x99] AdminDistributedSystemImpl backupAllMembers received all responses to FinishBackupRequest
> [warning 2015/12/01 14:37:51.184 PST locator <RMI TCP Connection(3)-192.168.2.11>
tid=0x99] AdminDistributedSystemImpl backupAllMembers COMPLETED
> {noformat}
> Here is some debugging in one of the members. The others are similar.
> {noformat}
> [warning 2015/12/01 14:37:38.795 PST  <Pooled Message Processor 2> tid=0x66] FlushToDiskRequest
createResponse received FlushToDiskRequest
> [warning 2015/12/01 14:37:38.974 PST  <Pooled Message Processor 2> tid=0x66] FlushToDiskRequest
createResponse sending FlushToDiskResponse
> [warning 2015/12/01 14:37:38.978 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupRequest
createResponse received PrepareBackupRequest
> [warning 2015/12/01 14:37:38.978 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupResponse
createResponse: About to start backup
> [warning 2015/12/01 14:37:38.980 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupResponse
createResponse: About to prepare backup
> [warning 2015/12/01 14:37:38.980 PST  <Pooled Message Processor 2> tid=0x66] BackupLock
lockForBackup invoked
> [warning 2015/12/01 14:37:38.980 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupRequest
createResponse sending PrepareBackupResponse
> [warning 2015/12/01 14:37:38.986 PST  <Pooled Message Processor 2> tid=0x66] FinishBackupRequest
createResponse received FinishBackupRequest
> [warning 2015/12/01 14:37:39.970 PST  <Pooled Message Processor 2> tid=0x66] BackupLock
unlockForBackup invoked
> [warning 2015/12/01 14:37:49.885 PST  <Pooled Message Processor 2> tid=0x66] FinishBackupRequest
createResponse sending FinishBackupResponse
> {noformat}
> Processing the {{PrepareBackupRequest}} is where the {{BackupLock}} is taken (in {{BackupLock
lockForBackup}}). Processing the {{FinishBackupRequest}} is where it is released (in {{BackupLock
unlockForBackup}}).
> All members operate independently. If one completes its backup before the others, the
issue can happen. That member is available to start a new backup, but the others aren't.
> In this case, the member successfully processes the {{PrepareBackupRequest}} and takes
the {{BackupLock}}.
> The other members do not successfully process the {{PrepareBackupRequest}} and do not
take the {{BackupLock}}:
> {noformat}
> [warning 2015/12/01 14:46:02.447 PST  <Pooled Message Processor 3> tid=0xd7] PrepareBackupRequest
createResponse received PrepareBackupRequest
> [warning 2015/12/01 14:46:02.448 PST  <Pooled Message Processor 3> tid=0xd7] PrepareBackupResponse
createResponse: About to start backup
> [error 2015/12/01 14:46:02.448 PST  <Pooled Message Processor 3> tid=0xd7] Error
processing request class com.gemstone.gemfire.admin.internal.PrepareBackupRequest.
> java.io.IOException: Backup already in progress
> 	at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.startBackup(GemFireCacheImpl.java:4315)
> 	at com.gemstone.gemfire.admin.internal.PrepareBackupRequest.createResponse(PrepareBackupRequest.java:87)
> 	at com.gemstone.gemfire.internal.admin.remote.CliLegacyMessage.process(CliLegacyMessage.java:28)
> 	at com.gemstone.gemfire.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:386)
> 	at com.gemstone.gemfire.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:457)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:692)
> 	at com.gemstone.gemfire.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:963)
> 	at java.lang.Thread.run(Thread.java:745)
> [warning 2015/12/01 14:46:02.448 PST  <Pooled Message Processor 3> tid=0xd7] PrepareBackupResponse
createResponse: Returning exception java.io.IOException: Backup already in progress
> {noformat}
> Once this happens, the members are out of sync. One has taken the {{BackupLock}}; the
others have not. The member that has taken the {{BackupLock}} will never release it.
> One solution would be to prevent any backup from starting if one is in progress (in the
locator). Another would be to force the member to release the {{BackupLock}} if other members
fail to get it. I think this would require a new message, though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message