cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Burwell <>
Subject Re: [MERGE]object_store branch into master
Date Thu, 30 May 2013 14:43:04 GMT

Please see my comment in-line.


On May 29, 2013, at 5:15 PM, Edison Su <> wrote:

>> -----Original Message-----
>> From: Chip Childers []
>> Sent: Wednesday, May 29, 2013 1:41 PM
>> To:
>> Subject: Re: [MERGE]object_store branch into master
>> On Tue, May 28, 2013 at 11:42:49PM +0000, Edison Su wrote:
>>>> -----Original Message-----
>>>> From: John Burwell []
>>>> Sent: Tuesday, May 28, 2013 8:43 AM
>>>> To:
>>>> Subject: Re: [MERGE]object_store branch into master
>>>> All,
>>>> I have gone a through a large chunk of this patch, and published my
>> review
>>>> thus far (   TL;DR is that this patch
>> has a
>>>> number of significant issues which can be summarized as follows:
>>> Thanks for your time to review this patch.
>>>> 1. While it appeas that the requirement of NFS for secondary storage
>>>> has largely been removed, it has basically been if blocked out
>>>> instead of pushed down as an detail choice in the physical layer.
>>>> Rather than exploiting polymorpish to vary behavior through a set of
>>>> higher level abstracttions, the orchestration performs instanceof
>>>> NFSTO checks. The concern is that future evolution of secondary storage
>> layer will still have dependencies on NFS.
>>> As long as mgt server doesn't explicitly depend on nfs secondary storage
>> during the orchestration, and the nfs storage can be used based on different
>> configurations, then the resource side "NFSTO checks" is not much concern
>> to me at the current stage. We can add a " polymorpish high level
>> abstraction"  at the resource side, as you suggested before, in the future.
>> The current code change is already so huge now, add a new level a
>> abstraction will make the change even much bigger. But adding this kind of
>> "high level abstraction" is not hard any more, as all the storage related
>> commands, are only dealing with DataTO/DataStoreTO. Right now, there is
>> no read/write data operations on this *TO, one possible way to add a "high
>> level abstraction" would be add one read/write data operations at each
>> hypervisor resource based on the DataTO and DataStoreTO.
>>> My point is that refactor mgt server code is much much harder than the
>> resource code. If we can get the mgt server right, then we can move on to
>> refactor resource code, in the future.

Due to their impact on dependent subsystems and lower layers, these types of design questions
need to be addressed before a merge to master. 

>>>> 2. In some sceanrios, NFS is still a requirement for secondary
>>>> storage.  In particular, Xen users will still have to maintain an
>>>> "NFS cache".  Given the degradation of capability, I think it would
>>>> be helpful to put a few more eyeballs on the Xen implementation to
>>>> determine if we could avoid using an intermediate file system.
>>> Nfs will not disappear soon, for some hypervisors, for some type of
>> storages. But as long as the mgt server doesn't have this requirement, then
>> people can add new type of hypervisors(e.g. hyperv), new type of
>> storages(ceph, solidfire etc) which don't need NFS. The optimization for a
>> particular hypervisor can be done by other people, who may be familiar with
>> some type of storage, or a certain hypervisor.

It feels like we have jumped to a solution without completely understanding the scope of the
problem and the associated assumptions.  We have a community of hypervisor experts who we
should consult to ensure we have the best solution.  As such, I recommend mailing the list
with the specific hypervisors and functions that you have been unable to interface to storage
that does not present a filesystem.  I do not recall seeing such a discussion on the list

>>>> 3. I have the following concerns regarding potential race conditions
>>>> and resource exhaustion in the NFS cache implementation.
>>>>    - The current random allocator algorithm does not account for
>>>> the amount space that will be required for the operation (i.e.
>>>> checking to ensure that the cache it picks has enough space to
>>>> transfer to hold the object being
>>>> downloaded) nor does it reserve the space.Given the long (in compute
>>>> time) running nature of these of processes, the available space in a
>>>> cache could be exhausted by a number of concurrently transfering
>>>> templates and/or snapshots.  By reserving space before the transfer,
>>>> the allocator would be able to account for both pending operations
>>>> and the current contents of the cache.
>>> 1. It's the current behavior of 4.1 and master secondary storage code. We
>> didn't introduce new issue here.
>>> 2. Add a new cache storage allocator is much easier than adding it in master
>> branch.
>>> 3. I treat it as a bug, and plan to fix it after the merge.
>> We should not be merging in major changes with *known* bugs like this.
> first and foremost, it's a well known issue in master branch, don't matter merge the
branch or not.
> If we want to fix the issue, either fix it on object_store branch or on master branch,
but both will need time. 
> As the 4.2 merge window is approaching, I don't think we can fix it in these two or three
> If we missed the merge window, then who will fix it on master?

As I understand the goals of this enhancement, we will support additional secondary storage
types and removing the assumption that secondary storage will always be NFS or have a filesystem.
 As such, when a non-NFS type of secondary storage is employed, NFS is no longer the repository
of record for this data.  We can always exceed available space in the repository of record,
and the failure scenarios are relatively well understood (4.1.0) -- operations will fail quickly
and obviously.  However, as a transitory staging storage mechanism (4.2.0), the expectation
of the user is the NFS storage will not be as reliable or large.  If the only solution we
can provide for this problem is to recommend an NFS "cache" that is equal to the size of the
object store itself then we have little to no progress addressing our user's needs.  Fundamentally,
the role of the NFS is different in 4.2.0 than 4.1.0.  Therefore, I disagree with the assertion
that issue is present in 4.1.0.

An additional risk in the object_store implementation is that we lead a user to believe their
data has been stored in reliable storage (e.g. S3, Riak CS, etc) when it may not.  I saw no
provision in the object_store to retry transfers if the object_store transfer fails or becomes
unavailable.  In 4.0.0/4.1.0, if we can't connect to S3 or Swift, a background process continuously
retries the upload until successful. 

Finally, I see this issue as a design issue than a bug.  I don't think we should be merging
branches that we know have a defect let alone a design issue.  A proper implementation will
require refining/expanding the StorageAllocator interface, and introducing a mechanism to
track reservations (including proper TTL support).  IMHO, this these type of issues must resolved
before a feature can be merged into master.

>>>>    - There appears no mechanism to age out contents of the cache.
>>>> Therefore, as implemented, it will grow unbounded.  The only
>>>> workaround for this problem would be to have an NFS cache whose size
>>>> equals that of the object store.
>>> Admin can multiple nfs cache storages into one zone, and it's the current
>> behavior on master branch. Again, we can add aging policy later, as cache
>> storage has its own pom project, and it's own implementation.
>> That doesn't make much sense to me.  A cache is expected to be, by it's
>> nature, bounded by time / size / something...  We need to implement some
>> sort of bounds here.
>> It also shouldn't be a failure point if you have a cache miss or cache failure.
>> What behaviour occurs if there is only 1 and it's full?
> As I said before, it's the current behavior of master branch. On master branch, there
is no code the check the capacity of nfs secondary storage.
> I do have plan to fix it on object_store branch, but the code change is so huge, take
a lot of time to refactor other stuff, so my plan is to fix it later.

Given the different use of NFS in the object_store branch vs. current, I don't see the comparison
in this case.  In the current implementation, when we exhaust space, we are truly out of resource.
 However, in the object_store branch, we have no provision to remove stale data and we may
report no space available when there is plenty of space available in the underlying object
store.  In this scenario, the NFS "cache" becomes an artificial limiter on the capacity of
the system.  I do not understand how we have this problem in current since the object store
is only a backup of secondary store -- not secondary storage itself.

>>>>    - The mechanism lacks robust error handling or retry logic.  In
>>>> particular, the behavior if/when a cache exhausts available space appears
>> non-deterministic.
>>> Again, master branch will have the same issue, fix the issue will be easier
>> than fixing it on master branch.
>> Why not fix it now, since the cache mechanism is being touched?
> 1. Do need time to fix it. At least one week.
> 2. What's difference between, fix it on object_store branch, instead of after merge?
As this issue is not introduced by object_store branch.

It is my estimate robust error handling will require design changes (e.g. introduction of
a resource reservation mechanism, introduction of addition exception classes, enhancement
of interfaces to provide more context regarding client intentions, etc) yielding significant
code impact.  These changes need to undertaken in a holistic manner with minimum risk to master.
  Fundamentally, we should not be merging code to master with known significant issues.  When
it goes to master, we should be saying, "To the best of my knowledge and developer testing,
there are no blocker or critical issues."  In my opinion, omission of robust error handling
does not meet that standard. 

>>>>    - Generally, I see little consideration for
>>>> alternative/exception flows.  For example, what happens if I attempt
>>>> to use a template/iso/snapshot in transit to the object store
>>>> to/from the cache?  Since these files can be very large (multiple GBs), we
>> have assume some transfer latency in all interactions.
>>> Management server mainly maintains the state of each
>> object(volume/snapshot/template), we add state machine for all the objects,
>> and for all the objects on each storages, which is unavailable on the master
>> branch. Take copy template from object store to cache as an example, the
>> state of template entry on object storage and cache storage will be changed
>> to "Copying", if copying failed or time out, then the state of template entry
>> on object storage will be changed to "Ready", while the state of template
>> entry on cache storage will be changed to "Failed".
>>> As long as mgt server maintains the state correctly, user/admin can issue
>> the same operation through UI/API in case one operation is failed. That' s
>> one of reason, there is no retry logic in the storage subsystem code. Another
>> reason is that adding retry logic on top of async api will a little bit difficult.
>>>> 4. The Image Cache abstraction is too tightly coupled to the core
>>>> orchestration mechanism.  I would recommend implementing it as a
>>> The only place is coupled with image cache abstraction is at
>>> datamotionstrategy, in particular, AncientDataMotionStrategy -
>>> needCacheStorage and AncientDataMotionStrategy will call
>> StorageCacheManager-> createCacheObject if cache storage is needed.
>>>> DataStore proxy that is applied as necessary.  For example, the
>>>> current Xen hypervisor implementation must interact with a file
>>>> system.  It seems most appropriate that the Xen hypervisor
>>>> implementation would apply the proxy to its secondary storage
>>>> DataStore instances.  Therefore, the storage engine
>>> The issue that struggles me for a while is that, how and where to configure
>> cache storage. Take S3 as an example, the implementation of S3 provider
>> should not need to care about cache storage at all, S3 implementation should
>> only care about how to deal with data stored in S3. While S3 is a general
>> purpose storage, the same implementation can be used by different
>> hypervisors, and in different situations. For example,  If both Ceph and S3
>> used by KVM, then definitely, nfs cache is not needed, while for other
>> cluster wide primary storages supported by KVM, nfs cache is needed.
>>> How and where to present this kind of configuration, and also both primary
>> storage provider and object storage provider implementation don't need to
>> aware this kind of configuration? My current solution is that, write the
>> decision in the code, DataMotionStrategy is the place to decide to use cache
>> storage or not. In the default implementation, all the copy operations from
>> s3 to primary storage, cache storage is needed, while if people adding a new
>> storage, and want to remove cache storage for certain reason, then add a
>> new DataMotionStrategy.
>>> I am not sure how the datastore proxy will solve the issue I mentioned
>> above, I am glad to know your solution.
>> This looks like a good opportunity for some collaboration on the design!
> I am glad to know other solutions, or other people write code on the branch. 
> Code is always welcome.

Fundamentally, there should be no dependency from Storage to Hypervisor.  The scenario presented
here makes we wonder whether or not data motion belongs in the Storage or Hypervisor layers.
 Storage doesn't care if the I/O is for data motion.  It's only concern is the movement of
data from one location to another.  I am beginning to feel that DataMotion belongs in the
Hypervisor layer and the Storage layer should expose a more basic set of primitives that are
composed by the service to orchestrate the required I/O operations.  Can you shed more light
on the reasoning for putting this service in Storage rather Hypervisor?

To my mind, the storage services should present options that allow clients declare the requirements.
 For example, if the Xen hypervisor implement must perform an operation through a file system
interface, it should express that need to the storage subsystem.  In turn, the DataStore returned
will fulfill that contract.  From an implementation perspective, I suggest the cache mechanism
be implemented as proxy of the underlying DataStore implementation.  There are likely other
solutions -- I was merely trying to suggest approaches that would reduce the coupling for
the purposes of discussion.

>>>> should only provide the facility not attempt to determine when it is
>> needed.
>>>> Additionally, a proxy model would greatly reduce the size and
>>>> complexity of the various classes (e.g. AncientDataMotionStrategy).
>>> The current AncientDataMotionStrategy is only about 400 LOC now, I
>> removed some of the duplicated code. But still want to know the detail of
>> your solution.
>>>> 5. While I have only reviewed roughly 50-60% of the patch thus far,
>>>> I see little to no additional unit tests for the new code added.
>>>> Integration tests have
>>> If the code is changing rapidly, then writing/maintaining unit test cases will
>> take huge effort, we focused on the integration test and marvin test in this
>> stage.
>> We shouldn't be merging code that is "still changing rapidly".  What do you
>> mean by this comment?
> Oh, should be "was changing rapidly" in the last month, in May, we moved on to integration
test and marvin test and a lot of bug fixes.
>>>> been expanded, but many of the test simply execute the service
>>>> methods and do not verify the form of the operation results. There
>>>> are also a tremendous number of ignored exceptions that should likely
>> fail these tests.
>>>> Therefore, we could have tests passing due to an ignored exception
>>>> that should be failing.
>>> 1. Thanks, you take a look at the integration test, which is very convenient
>> for us during the implementation. It saves us a lot time during the test.
>>> 2. Yes, the integration test needed to be improved as your suggested.
>> Let's do that please!

Without asserts, the tests are simply exercising the code to ensure that no exceptions are
being thrown.  They are not verifying that the test actually succeeded.  Until all methods
properly verify the results of the call, it is difficult to objectively gage the quality of
the branch.  Additionally, the integration tests verify very few alternative and exception
flows.  I would like to see more of those scenarios exercised.

>>>> While I recognize the tremendous effort that has been expended to
>>>> implement this capability and value of the feature, I see
>>>> significant design and operational issues that must be addressed
>>>> before it can be accepted into
>>> I would say the object _store branch is just better than the master branch,
>> almost all the issues(related to cache storages) you pointed out, are existing
>> on the master branch. While fixing these issues on master branch is harder
>> than on object_store branch. Frankly speaking, I don't want to
>> read/understand/guess how it works on the master branch, it's a lot of hack
>> all around the place.
>>> Of course, these issues are still issues, we need to fix them, but when to fix
>> them is arguable. I prefer fixing them after the merge. Here is the reason:
>>> If we agreed on the issues you pointed out are existing on the master also,
>> then do we want to fix them on the 4.2 or not? If the answer is yes, then
>> who will fix them, and how long it will take to fix them on master? If I
>> propose an offer, that I'll fix these issues after merging object_store into
>> master, will it be acceptable? If the answer is no,  then there is no technical
>> reason to not merge object_store into master, as these issues are not
>> introduced by object_store branch.
>>>> master.  In my opinion, the object_store represents a good first
>>>> step, but it needs a few more review/refinement iterations before it
>>>> will be ready for a master merge.

I agree with Chip.  In addition to the issues I have mentioned, this is a massive change late
in the cycle for little overall gain.  We have not yet achieved the key goal our users want
-- reliable, horizontally scalable secondary storage on commodity hardware (i.e. cheaper per
GB with the same reliability). As such, I will take proven/released, "ugly" code over unproven/unreleased,
"pretty" code every time.  

>>>> Thanks,
>>>> -John

View raw message