cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Burwell <>
Subject Re: [DISCUSS] NFS cache storage issue on object_store
Date Wed, 05 Jun 2013 20:04:29 GMT

You have provided some great information below which helps greatly to understand the role
of the "NFS cache" mechanism.  To summarize, this mechanism is only currently required for
Xen snapshot operations driven by Xen's coalescing operations.  Is my understanding correct?
 Just out of curiosity, is their a Xen expert on the list who can provide a high-level description
of the coalescing operation -- in particular, the way it interacts with storage?  I have Googled
a bit, and found very little information about it.  Has the object_store branch been tested
with VMWare and KVM?  If so, what operations on these hypervisors have been tested?

In reading through the description below, my operation concerns remain regarding potential
race conditions and resource exhaustion.  Also, in reading through the description, I think
we should find a new name for this mechanism.  As Chip has previous mentioned, a cache implies
the following characteristics:

    1. Optional: Systems can operate without caches just more slowly.  However, with this
mechanism, snapshots on Xen will not function.
    2. Volatility: Caches are backed by durable, non-volitale storage.  Therefore, if the
cache's data is lost, it can be rebuilt from the backing store and no data will be permanently
lost from the system.  However, this mechanism contains snapshots in-transit to an object
store.  If the data contained in this "cache" were lost before its transfer to the object
store completed, the snapshot data would be lost.

In order to set expectations with users and better frame our design conversation, I think
it would be appropriate this mechanism as a staging, scratch, or temporary area.  I also recommend
removing the notion of NFS its name as NFS is initial implementation of this mechanism.  In
the future, I can see a desire for local filesystem, RBD, and iSCSI implementations of it.

In terms of solving the potential race conditions and resource exhaustion issues, I don't
think an LRU approach will be sufficient because the least recently used resource may be still
be in use by the system.  I think we should look to a reservation model with reference counting
where files are deleted when once no processes are accessing them.  The following is a (handwave-handwave)
overview of the process I think would meet these requirements:

	1. Request a reservation for the maximum size of the file(s) that will be processed in the
staging area.
		- If the file is already in the staging area, increase its reference count
		- If the reservation can not be fulfilled, we can either drop the process in a retry queue
or reject it.  
	2. Perform work and transfer file(s) to/from the object store
	3. Release the file(s) -- decrementing the reference count.  When the reference count is
<= 0, delete the file(s) from the staging area

We would also likely want to consider a TTL to purge files after a configurable period of
inactivity as a backstop against crashed processes failing to properly decrementing the reference
count.  In this model, we will either defer or reject work if resources are not available,
and we properly bound resources.  

Finally, in terms of decoupling the decision to use of this mechanism by hypervisor plugins
from the storage subsystem, I think we should expose methods on the secondary storage services
that allow clients to explicitly request or create resources using files (i.e.
instead of streams (e.g. createXXX(File) or readXXXAsFile).  These interfaces would provide
the storage subsystem with the hint that the client requires file access to the request resource.
  For object store plugins, this hint would be used to wrap the resource in an object that
would transfer in and/out of the staging area.


On Jun 3, 2013, at 7:17 PM, Edison Su <> wrote:

> Let's start a new thread about NFS cache storage issues on object_store.
> First, I'll go through how NFS storage works on master branch, then how it works on object_store
branch, then let's talk about the "issues".
> 0.       Why we need NFS secondary storage? Nfs secondary storage is used as a place
to store templates/snapshots etc, it's zone wide, and it's widely supported by most of hypervisors(except
HyperV). NFS storage exists in CloudStack since 1.x. With the rising of object storage, like
S3/Swift, CloudStack adds the support of Swift in 3.x, and S3 in 4.0. You may wonder, if S3/Swift
is used as the place to store templates/snapshots, then why we still need NFS secondary storage?
> There are two reasons for that:
> a.       CloudStack storage code is tightly coupled with NFS secondary storage, so when
adding Swift/S3 support, it's likely to take shortcut, leave NFS secondary storage as it is.
> b.      Certain hypervisors, and certain storage related operations, can not directly
operate on object storage.
> Examples:
> b.1 When backing up snapshot(the snapshot taken from xenserver hypervisor) from primary
storage to S3 in xenserver
> If there are snapshot chains on the volume, and if we want to coalesce the snapshot chains
into a new disk, then copy it to S3, we either, coalesce the snapshot chains on primary storage,
or on an extra storage repository (SR) that supported by Xenserver.
> If we coalesce it on primary storage, then may blow up the primary storage, as the coalesced
new disk may need a lot of space(thinking about, the new disk will contain all the content
in from leaf snapshot, all the way up to base template), but the primary storage is not planned
to this operation(cloudstack mgt server is unaware of this operation, the mgt server may think
the primary storage still has enough space to create volumes).
> While xenserver doesn't have API to coalesce snapshots directly to S3, so we have to
use other storages that supported by Xenserver, that's why the NFS storage is used during
snapshot backup. So what we did is that first call xenserver api to coalesce the snapshot
to NFS storage, then copy the newly created file into S3. This is what we did on both master
branch and object_store branch.
>                               b.2 When create volume from snapshot if the snapshot is
stored on S3.
>                                                 If the snapshot is a delta snapshot,
we need to coalesce them into a new volume. We can't coalesce snapshots directly on S3, AFAIK,
so we have to download the snapshot and its parents into somewhere, then coalesce them with
xenserver's tools. Again, there are two options, one is to download all the snapshots into
primary storage, or download them into NFS storage:
>                                                If we download all the snapshots into
primary storage directly from S3, then first we need find a way import snapshot from S3 into
Primary storage(if primary storage is a block device, then need extra care) and then coalesce
them. If we go this way, need to find a primary storage with enough space, and even worse,
if the primary storage is not zone-wide, then later on, we may need to copy the volume from
one primary storage to another, which is time consuming.
>                                                If we download all the snapshots into
NFS storage from S3, then coalesce them, and then copy the volume to primary storage. As the
NFS storage is zone wide, so, you can copy the volume into whatever primary storage, without
extra copy. This is what we did in master branch and object_store branch.
>                              b.3, some hypervisors, or some storages do not support directly
import template into primary storage from a URL. For example, if Ceph is used as primary storage,
when import a template into RBD, need transform a Qcow2 image into RAW disk, then into RBD
format 2. In order to transform an image from Qcow2 image into RAW disk, you need extra file
system, either a local file system(this is what other stack does, which is not scalable to
me), or a NFS storage(this is what can be done on both master and object_store). Or one can
modify hypervisor or storage to support directly import template from S3 into RBD. Here is
the link(, that Wido
>                 Anyway, there are so many combination of hypervisors and storages: for
some hypervisors with zone wide file system based storage(e.g. KVM + gluster/NFS as primary
storage), you don't need extra nfs storage. Also if you are using VMware or HyperV, which
can import template from a URL, regardless which storage your are using, then you don't need
extra NFS storage. While if you are using xenserver, in order to create volume from delta
snapshot, you will need a NFS storage, or if you are using KVM + Ceph, you also may need a
NFS storage.
>                Due to above reasons, NFS cache storage is need in certain cases if S3
is used as secondary storage. The combination of hypervisors and storages are quite complicated,
to use cache storage or not, should be case by case. But as long as cloudstack provides a
framework, gives people the choice to enable/disable cache storage on their own, then I think
the framework is  good enough.
> 1.       Then let's talk about how NFS storage works on master branch, with or without
> If S3 is not used, here is the how NFS storage is used:
> 1.1   Register a template/ISO: cloudstack downloads the template/ISO into NFS storage.
> 1.2   Backup snapshot: cloudstack sends a command to xenserver hypervisor, issue vdi.copy
command copy the snapshot to NFS, for kvm, directly use "cp" or "qemu-img convert" to copy
the snapshot into NFS storage.
> 1.3   Create volume from snapshot: If the snapshot is a delta snapshot, coalesce them
on NFS storage, then vdi.copy it from NFS to primary storage. If it's KVM, use "cp" or "qemu-img
convert" to copy the snapshot from NFS storage to primary storage.
>               If S3 is used:
> 1.4   Register a template/ISO: download the template/ISO into NFS storage first, then
there is background thread, which can upload the template/ISO from NFS storage into S3 regularly.
The template is in Ready state, only means the template is stored on NFS storage, but admin
doesn't know the template is stored on the S3 or not. Even worse, if there are multiple zones,
cloudstack will copy the template from one zone wide NFS storage into another NFS storage
in another zone, while there is already has a region wide S3 available. As the template is
not directly uploaded to S3 when registering a template, it will take several copy in order
to spread the template into a region wide.
> 1.5   Backup snapshot: cloudstack sends a command to xenserver hypervisor, copy the snapshot
to NFS storage, then immediately, upload the snapshot from NFS storage into S3. The snapshot
is in Backedup state, not only means the snapshot is in  NFS storage, but also means it's
stored on S3.
> 1.6   Create volume from snapshot: download the snapshot  and it's parent snapshots from
S3 into NFS storage, then coalesce and vdi.copy the volume from NFS to primary storage.
> 2.       Then let's talk about how it works on object_store:
> If S3 is not used, there is ZERO change from master branch. How the NFS secondary storage
works before, is the same on object_store.
> If S3 is used, and NFS cache storage used also(which is by default):
>   2.1 Register a template/ISO: the template/ISO are directly uploaded to S3, there is
no extra copy to NFS storage. When the template is in "Ready" state, means the template is
stored on S3.                  It implies that: the template is immediately available in the
region as soon as it's in Ready State. And admin can clearly knows the status of template
on S3, what's percentage of the uploading, is it failed or succeed? Also if register template
failed for some reason, admin can issue the register template command again. I would say the
change of how to register template into S3 is far better than what we did on master branch.
>   2.2 Backup snapshot: it's same as master branch, sends a command to xenserver host,
copy the snapshot into NFS, then upload to S3.
>   2.3 Create volume from snapshot: it's the same as master branch, download snapshot
and it's parent snaphots from S3 into NFS, then copy it from NFS to primary storage.
> From above few typical usage cases, you may understand how S3 and NFS cache storage is
used, and what's difference between object_store branch and master branch: basically, we only
change the way how to register a template, nothing else.
> If S3 is used, and no NFS cache storage is used(it's possible, depends on which datamotion
strategy is used):
>    2.4 Register a template/ISO: it's the same as 2.1
>    2.5 Backup snapshot: export the snapshot from primary storage into S3 directly
>    2.6 Create volume from snapshot: download snapshots from S3 into primary storage directly,
then coalesce and create volume from it.
>          Hope above explanation will tell the truth how the system works on object_store,
and clarify the misconception/misunderstanding  about object_store branch. Even the change
is huge, we still maintain the back compatibility. If you don't want to use S3, only want
to existing NFS storage, it's definitely OK, it works the same as before. If you want to use
S3, we provide a better S3 implementation when registering template/ISO. If you want to use
S3 without NFS storage, that's also definitely OK,  the framework is quite flexible to accommodate
different solutions.
> Ok, let's talk  about the NFS storage cache issues.
> The issue about NFS cache storage is discussed in several threads, back and forth. All
in all, the NFs cache storage is only one usage case out of three usage cases supported by
object_store branch. It's not something that if it has issue, then everything doesn't work.
> In above 2.2 and 2.3, it shows how the NFS cache storage is involved during snapshot
related operations. The complains about there is no aging policy, no capacity planner for
NFS cache storage, is happened when download a snapshot from S3 into NFS, or copy a snapshot
from primary storage into NFS, or download template from S3 into NFS. Yes, it's an issue,
the NFS cache storage can be used out, if there is no capacity planner, and no aging out policy.
But can it be fixed? Is it a design issue?
> Let's talk the code: Here is the code related to NFS cache storage, not much, only one
class depends on NFS cache storage:;a=blob;f=engine/storage/datamotion/src/org/apache/cloudstack/storage/motion/;h=a01d2d30139f70ad8c907b6d6bc9759d47dcc2d6;hb=refs/heads/object_store
> Take copyVolumeFromSnapshot as example, which will be called when create Volume from
snapshot, if first calls cacheSnapshotChain, which will call cacheMgr.createCacheObject to
download the snapshot into NFs cache storage. StorageCacheManagerImpl-> createCacheObject
is the only place to create objects on NFs cache storage, the code is at;a=blob;f=engine/storage/cache/src/org/apache/cloudstack/storage/cache/manager/;h=cb5ea106fed3e5d2135dca7d98aede13effcf7d9;hb=refs/heads/object_store
> In createCacheObject, it will first find out a cache storage, in case there are multiple
cache storages available in a scope:
> DataStore cacheStore = this.getCacheStorage(scope);
> getCacheStorage will call StorageCacheAllocator to find out a proper NFS cache storage.
So StorageCacheAllocator is the place to choose NFS cache storage based on certain criteria,
the current implementation only randomly choose one of them, we can add a new allocator algorithm,
based on capacity etc, etc.
> Regarding capacity reservation, there is already a table, called op_host_capacity which
has entry for NFS secondary storage, we can reuse this entry to store capacity information
about NFS cache storages(such as, total size, available/used capacity etc). So when every
call createCacheObject, we can call StorageCacheAllocator to find out a proper NFS storage
based on first fit criteria, then increase used capacity in op_host_capacity table. If the
create cache object failed, return the capacity to op_host_capacity.
> Regarding the aging out policy, we can start a background thread on mgt server, which
will scan all the objects created on NFS cache storage(the tables called: snapshot_store_ref,
template_store_ref, volume_store_ref), each entry of these tables has a column called: updated,
every time, when the object's state is changed, the "updated" column will be got updated also.
When the object's state is changed? Every time, when the object is used in some contexts(such
as copy the snapshot on NFS cache storage into somewhere), the object's state will be changed
 accordingly, such as "Copying", means the object is being copied to some place, which is
exactly the information we need to implement LRU algorithm.
> How do you guys think about the fix? If you have better solution, please let me know.

View raw message