cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Suich" <chris.su...@netapp.com>
Subject Re: Review Request 14522: [CLOUDSTACK-4771] Support Revert VM Disk from Snapshot
Date Wed, 16 Oct 2013 01:58:49 GMT


> On Oct. 15, 2013, 2:41 p.m., John Burwell wrote:
> > engine/api/src/org/apache/cloudstack/engine/subsystem/api/storage/PrimaryDataStoreDriver.java,
line 28
> > <https://reviews.apache.org/r/14522/diff/2/?file=364638#file364638line28>
> >
> >     I don't know that a flag is necessary to represent this capability since we
already have a revertSnapshot method.  Personally, I dislike the complexity of exposing methods
on an interface that are guarded by another.  I would prefer to see revertSnapshot enhanced
to throw an exception when a snapshot can't be reverted.  In addition to be a simpler approach,
it covers both the scenario when a device does not support reverting snapshots and, as well
as, the device or the snapshot not being a revertable state.
> 
> Chris Suich wrote:
>     The purpose of canRevertSnapshot() is not to guard revertSnapshot(), but to determine/query
whether the driver supports reverting so that the UI can appropriately show/hide the revert
snapshot action. I guess in my mind, canRevertSnapshot() should not check things like the
state of the snapshot, but things like, is the base volume still on the associated storage
(since the base volume could have been moved to a new primary), etc.
> 
> John Burwell wrote:
>     I think we need to find a more generic way to advertise driver capabilities.  Reviewing
the rest of the patch, it appears that this method is being used for more than simple driver
capability checking.  I need to think through the best approach, but as a strawman, we could
consider the following:
>     
>         public interface SnapshotCapability { ... }
>         
>         public enum DefaultSnapshotCapabilities implements StorageCapability {
>             /* other capabilities */
>             REVERT;
>         }
>     
>     This approach would include extracting all of the snapshot operations to a separate
Snapshotable interface that would define all of the snapshot operations plus Set<SnapshotCapability>
getSnapshotCapabilities().  Drivers that support snapshotting would implement the Snapshotable
interface in addition to the PrimaryDataStoreDriver interface.  A simple instanceof check
determines whether or not a device can be snapshotted and a query of the getSnapshotCapabilities
method to answer more fine grained questions.  The API layer can then translate those capabilities
into specific boolean values or we can pass the whole set back to UI for decision making.
>     
>     Additionally, does this method refer to hypervisor or storage-level snapshots?  If
it refers to hypervisor snapshots, why is the determination about the ability to revert being
made by a storage driver?
> 
> Chris Suich wrote:
>     Are there any other capabilities that would be included in this (other than take
snapshot)?
>     
>     This refers only to storage-level snapshots. However, a snapshot strategy overrides
the hypervisor snapshot capabilities. So, hypervisor snapshots can be replaced with storage-level
snapshots if the storage driver supports it.
> 
> Chris Suich wrote:
>     At this level, though, we're not talking about driver capabilities. We're talking
about strategy capabilities. So having the storage driver implement Snapshotable won't do
much, because the snapshot strategies are the ones that need to be asked to handle operations.
> 
> John Burwell wrote:
>     Based on your comments, I think we need to rethink the approach a bit.  The storage
layer should *not* be concerned with the optimal snapshot strategy for a hypervisor/virtual
machine instance/volume etc.  It should simply expose a snapshot capability and let the hypervisor
layer determine the best approach.  Therefore, I think the whole notion of sorting strategies
in the storage layer needs to pushed up to the hypervisor layer where this determination should
be made.
>     
>     In terms of capability advertisement, the snapshot capability of a storage device
will net down to the capabilities of the underlying driver.  Therefore, there must be some
consider of the driver model when discussing higher-level storage operations.

There is a problem with that approach. The snapshot workflow of some storage vendors, myself
included, does not match that of CloudStack. Because of this, we need our own SnapshotStrategy
to handle our workflow. We should not be pushing snapshot strategies to the hypervisor layer
because it is not just hypervisors that have snapshot strategies.

I will submit a patch tomorrow that removes the changes that were made to the storage drivers.
In hindsight, they were unnecessary. However, I believe these changes pertaining to the snapshot
strategy are necessary, either in this form or another. How else can we enable plugins to
become the authorities on how to manage snapshots and ensure their implementations are invoked
instead of the hypervisor/default?

Were you able to get understand the objective and approach of this through the dev list discussion?
I'm frustrated that these issues weren't brought up until after the code has been written
and submitted for review when there has been discussion that I thought explained all of this
well before this review was posted.


> On Oct. 15, 2013, 2:41 p.m., John Burwell wrote:
> > engine/storage/snapshot/src/org/apache/cloudstack/storage/snapshot/SnapshotObject.java,
line 135
> > <https://reviews.apache.org/r/14522/diff/2/?file=364645#file364645line135>
> >
> >     What is the purpose of sorting the strategies?
> 
> Chris Suich wrote:
>     The purpose of sorting the strategies is to ensure that plugin strategies get priority
over the hypervisor implementations and default implementations. Additionally, it will also
make sure that plugin strategies that cannot handle the operation are at the end of the list,
to ensure that hypervisor or default strategies would be given the operation instead of the
wrong plugin.
> 
> John Burwell wrote:
>     This priority sorting should be completely encapsulated.  Requiring clients to "know"
leaks an important implementation detail that will lead to inconsistent behavior (i.e. bugs).
 One approach would be to use a SortedSet with a Comparator implementation that encapsulates
the logic.
> 
> Chris Suich wrote:
>     Encapsulated where? The list of strategies needs to be sorted based on the object
and operation.
> 
> John Burwell wrote:
>     My concern was that properly using a strategy list required every consumer of the
list to "know" that it needed to be sorted.  I want to see this type of logic encapsulated
such that consumers of the list can safely assume it is ordered properly.
>     
>     Based on my evolving understanding of the sort operation's purpose, I think we need
to revisit this approach.

The order of the list is extremely dynamic. It depends on both the object being operated on
and the operation to perform. How would you suggest we always ensure the list is ordered properly
when there is no authority managing that list?


> On Oct. 15, 2013, 2:41 p.m., John Burwell wrote:
> > engine/storage/integration-test/test/org/apache/cloudstack/storage/test/SnapshotTest.java,
line 437
> > <https://reviews.apache.org/r/14522/diff/2/?file=364644#file364644line437>
> >
> >     *This comment may be more editorial at this point because of the nature of the
existing code*
> >     I would prefer to see an exception thrown here rather than the complexity of
the canHandle.  I am also concerned that the canHandle method only checks that the underlying
device can take a snapshot without accounting for whether or not the volume is in a state
to take a consistent snapshot.  I am also concerned that we don't have a semantic for handling
errors that might result from the actual snapshot operation.
> 
> Chris Suich wrote:
>     That is actually the purpose of the canHandle() method - to ONLY check whether the
underlying device can handle the operation, not whether the storage is in the right state.
For example, if my driver can handle snapshotting, but I am unable to actually perform the
snapshot, I still want to be the one asked to take the snapshot so that I am the one to cause
the API to fail.
>     
>     Additionally, I'm not sure how we could exclude other strategies from attempting
to take my storage and try to snapshot. If there are two snapshot strategies and they are
not sorted by canHandle(), then some other strategy may be given the opportunity to take the
snapshot and simply fail the API because it cannot actually handle it.
> 
> John Burwell wrote:
>     This logic seems overly complicated.  There can be a number of reasons why a snapshot
operation will not succeed for a device.  Among those reasons is that the device is being
asked to perform an operation it does not support.  As I understand the operation, a snapshot
is a operation directed to a particular volume, Volume A.  When the storage layer is asked
to snapshot Volume A, the work will eventually be performed by the associated device driver,
Device Driver 1.  If Device Driver 1 does not support the requested operation, then the snapshot
operation fails.  There are not other options in this scenario.  Provided my understanding
is correct, I don't see the need for an additional canHandle method.  Simply call the method,
and if it doesn't throw an exception, the operation succeeded.  I am concerned that the current
interface seems to only concern itself with whether or not the driver can do something to
the exclusion of whether or not the device is in a state to properly p
 erform the operation.
> 
> Chris Suich wrote:
>     Ok, I think I see what you are saying. Once the strategies are sorted (by calling
canHandle()), then we no longer need to check whether the strategy can handle the operation
and should simply execute the operation on the top sorted strategy.
>     
>     Is that what you are saying?
> 
> John Burwell wrote:
>     Essentially, yes.  I also concerned with the atomicity of the operation once we properly
consider the state of the device.  By calling canHandle, and then performing the operation,
we creates a race condition.  Combining the check with the operation allows the system to
atomicity check and perform the operation -- eliminating races with other threads performing
the same operation.

How can we determine which strategy to use if we do not check whether the strategy can handle
an operation? The canHandle method should not be doing some dynamic check of the state of
a the storage to determine whether it can handle an operation. It should simply indicate whether
a strategy supports the static operation. If there is a better way to approach this, then
that is fine, but the strategies should be able to indicate that. Again, this has nothing
to do with the underlying storage. Storage drivers that want to be baked into the default
snapshotting workflow can implement the driver level takeSnapshot() and revertSnapshot() method.


> On Oct. 15, 2013, 2:41 p.m., John Burwell wrote:
> > engine/api/src/org/apache/cloudstack/engine/subsystem/api/storage/StrategyPriority.java,
line 46
> > <https://reviews.apache.org/r/14522/diff/2/?file=364641#file364641line46>
> >
> >     How does the result of the canHandle method drive the comparison of two Snapshots?
 Why not just compare the ops to each other other?
> 
> Chris Suich wrote:
>     The 'op' something I introduced to give strategies more flexibility in their canHandle()
method. The purpose of sorting the strategies by canHandle() is to ensure that strategies
that cannot handle the 'op' are at the end of the list while plugins are at the front of the
list (with hypervisors and default strategies in between).
> 
> John Burwell wrote:
>     While a valid approach for UI, it does not seem necessary in the bowels of the driver
code.  Given the number of drivers, it would be quicker to simply spin the list once than
sort and iterate.  As a knock-on, it will also simplify the method.
> 
> Chris Suich wrote:
>     The issue with this is that the response may not be the same for every snapshot for
every driver. For example, if I have the same driver applied to multiple primary storages
and there are snapshots for the same volume on each primary storage, then the only snapshots
which are revertable are those on the primary storage where the volume currently resides.
Some storage drivers may support reverting snapshots from other primary storages, some may
not.
> 
> John Burwell wrote:
>     See my comments above.  Now that I understand the purpose of this comparator, it
feels like something that is better placed in the hypervisor layaer rather the storage layer.
>     
>     
>

I'm not sure where this hypervisor layer you're talking about is. When requests come in for
an operation like snapshotting, it goes straight to the manager, which dispatches the request
to a strategy. Snapshot strategies provide alternatives to the hypervisor/default workflow.


> On Oct. 15, 2013, 2:41 p.m., John Burwell wrote:
> > server/src/com/cloud/api/ApiResponseHelper.java, line 461
> > <https://reviews.apache.org/r/14522/diff/2/?file=364650#file364650line461>
> >
> >     If the snapshot was instance of SnapshotInfo and no SnapshotInfo was found,
an exception should thrown.
> >     
> >     Provided my assumption is correct, I suggest refactoring to a structure such
as the following:
> >     
> >     final SnapshotInfo secondaryInfo = (snapshot instanceof SnapshotInfo) ? (SnapshotInfo)
snapshot : snapshotfactory.getSnapshot(snapshot.getId(), DataStoreRole.Image);
> >     
> >     if (secondaryInfo == null) {
> >        throw new CloudRuntimException(/* Appropriate error message */ );
> >     }
> >     
> >     snapshotResponse.setCanRevert(secondaryInfo.canRevert);
> 
> Chris Suich wrote:
>     Is this how we do things in other APIs? The reason I didn't thrown an exception is
that I'd be worried failing an entire API call just because a single snapshot appears to have
a broken DB-file on disk relationship. There is probably a better way to go about cleaning
it up, but is that worth failing the API over?
> 
> John Burwell wrote:
>     It's a question of fail fast or fail early.  If the snapshot data is corrupt then
the operation should not continue.  Failing quickly will provide the operator with a clear
explanation of the failure rather than appearing to be successful, but getting a corrupt result.
 From a development perspective, we get earlier, more accurate feedback about database corruption.
> 
> Chris Suich wrote:
>     Agreed - that is appropriate for an operator/admin. But what about the end user?
If I have a VM and all of a sudden my list of snapshots simply won't load, what can I do about
it? Instead, the list could load successfully, but not return that result, mark it as not
revertable, etc.
> 
> John Burwell wrote:
>     Failing slowly hurts the end user the most because their backup data is corrupt.
 Therefore, we should surface an error message to the end user that causes them their operator/admin.
 This type of error should not occur when the system is operating.  In the rare event it does
happen, we need to inform the end user as quickly as possible to prevent further corrupt of
the snapshot chain.

If this is the approach the community agrees with, I have no problem with that. I just hadn't
seen a scenario quite like this, so I went with what appeared to be the least destructive
case - I have no strong feelings either way.


- Chris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14522/#review27021
-----------------------------------------------------------


On Oct. 14, 2013, 10:50 p.m., Chris Suich wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/14522/
> -----------------------------------------------------------
> 
> (Updated Oct. 14, 2013, 10:50 p.m.)
> 
> 
> Review request for cloudstack, Brian Federle, edison su, John Burwell, and Mike Tutkowski.
> 
> 
> Repository: cloudstack-git
> 
> 
> Description
> -------
> 
> After the last batch of work to the revertSnapshot API, SnapshotServiceImpl was not tied
into the workflow to be used by storage providers. I have added the logic in a similar fashion
to takeSnapshot(), backupSnapshot() and deleteSnapshot().
> 
> I have also added a 'Revert to Snapshot' action to the volume snapshots list in the UI.
> 
> 
> Diffs
> -----
> 
>   api/src/org/apache/cloudstack/api/ApiConstants.java f85784b 
>   api/src/org/apache/cloudstack/api/response/SnapshotResponse.java e9cb109 
>   engine/api/src/org/apache/cloudstack/engine/subsystem/api/storage/PrimaryDataStoreDriver.java
b124d83 
>   engine/api/src/org/apache/cloudstack/engine/subsystem/api/storage/SnapshotInfo.java
8d6b760 
>   engine/api/src/org/apache/cloudstack/engine/subsystem/api/storage/SnapshotStrategy.java
e4cecb6 
>   engine/api/src/org/apache/cloudstack/engine/subsystem/api/storage/StrategyPriority.java
8bd04f5 
>   engine/api/test/org/apache/cloudstack/engine/subsystem/api/storage/StrategyPriorityTest.java
329b99f 
>   engine/storage/integration-test/test/org/apache/cloudstack/storage/test/FakePrimaryDataStoreDriver.java
810afd1 
>   engine/storage/integration-test/test/org/apache/cloudstack/storage/test/SnapshotTest.java
81f77d6 
>   engine/storage/snapshot/src/org/apache/cloudstack/storage/snapshot/SnapshotObject.java
3f35e1d 
>   engine/storage/snapshot/src/org/apache/cloudstack/storage/snapshot/XenserverSnapshotStrategy.java
6a874d6 
>   plugins/storage/volume/default/src/org/apache/cloudstack/storage/datastore/driver/CloudStackPrimaryDataStoreDriverImpl.java
3eaeb1f 
>   plugins/storage/volume/sample/src/org/apache/cloudstack/storage/datastore/driver/SamplePrimaryDataStoreDriverImpl.java
ece7b26 
>   plugins/storage/volume/solidfire/src/org/apache/cloudstack/storage/datastore/driver/SolidfirePrimaryDataStoreDriver.java
8046b6c 
>   server/src/com/cloud/api/ApiResponseHelper.java f4ca112 
>   server/src/com/cloud/storage/snapshot/SnapshotManagerImpl.java dade983 
>   ui/scripts/storage.js b16f4d4 
> 
> Diff: https://reviews.apache.org/r/14522/diff/
> 
> 
> Testing
> -------
> 
> I have tested all of this locally with a custom storage provider.
> 
> Unfortunately, I'm still in the middle of figuring out how to properly unit test this
type of code. If anyone has any recommendations, please let me know.
> 
> 
> Thanks,
> 
> Chris Suich
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message