Mailing-List: contact dev-help@cloudstack.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cloudstack.apache.org
Received-SPF: pass (athena.apache.org: domain of jburwell@basho.com designates
 209.85.216.43 as permitted sender)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\))
Subject: Re: [DISCUSS] NFS cache storage issue on object_store
From: John Burwell <jburwell@basho.com>
In-Reply-To: <77B337AF224FD84CBF8401947098DD87038B85@SJCPEX01CL01.citrite.net>
Date: Fri, 7 Jun 2013 17:38:56 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <8D81A089-C160-48B5-BC44-358867847574@basho.com>
References: <77B337AF224FD84CBF8401947098DD87036A76@SJCPEX01CL01.citrite.net>
 <136BD6D4-62F2-4074-8928-29C79F36EDB3@basho.com>
 <77B337AF224FD84CBF8401947098DD870383FA@SJCPEX01CL01.citrite.net>
 <170CE392-7640-4ECE-88FC-E3882D8F9532@basho.com>
 <77B337AF224FD84CBF8401947098DD87038B85@SJCPEX01CL01.citrite.net>
To: dev@cloudstack.apache.org

Edison,

Please see my commons in-line below.

Thanks,
-John

On Jun 6, 2013, at 6:43 PM, Edison Su <Edison.su@citrix.com> wrote:

>=20
>=20
>> -----Original Message-----
>> From: John Burwell [mailto:jburwell@basho.com]
>> Sent: Thursday, June 06, 2013 7:47 AM
>> To: dev@cloudstack.apache.org
>> Subject: Re: [DISCUSS] NFS cache storage issue on object_store
>>=20
>> Edison,
>>=20
>> Please my comments in-line below.
>>=20
>> Thanks,
>> -John
>>=20
>> On Jun 5, 2013, at 6:55 PM, Edison Su <Edison.su@citrix.com> wrote:
>>=20
>>>=20
>>>=20
>>>> -----Original Message-----
>>>> From: John Burwell [mailto:jburwell@basho.com]
>>>> Sent: Wednesday, June 05, 2013 1:04 PM
>>>> To: dev@cloudstack.apache.org
>>>> Subject: Re: [DISCUSS] NFS cache storage issue on object_store
>>>>=20
>>>> Edison,
>>>>=20
>>>> You have provided some great information below which helps greatly =
to
>>>> understand the role of the "NFS cache" mechanism.  To summarize, =
this
>>>> mechanism is only currently required for Xen snapshot operations
>>>> driven by Xen's coalescing operations.  Is my understanding =
correct?
>>>> Just out of
>>>=20
>>> I think Ceph may still need "NFS cache", for example, during delta =
snapshot
>> backup:
>>> http://ceph.com/dev-notes/incremental-snapshots-with-rbd/
>>> You need to create a delta snapshot into a file, then upload the =
file into S3
>>>=20
>>> For KVM, if the snapshot is taken on qcow2, then need to copy the
>> snapshot into a file system, then backup it to S3.
>>>=20
>>> Another usage case for "NFS cache " is to cache template stored on =
S3, if
>> there is no zone-wide primary storage. We need to download template =
from
>> S3 into every primary storage, if there is no cache, each download =
will take a
>> while: comparing download template directly from S3(if the S3 is =
region wide)
>> with download from a zone wide "cache" storage, I would say, the =
download
>> from zone wide cache storage should be faster than from region wide =
S3. If
>> there is no zone wide primary storage, then we will download the =
template
>> from S3 several times, which is quite time consuming.
>>>=20
>>>=20
>>> There may have other places to use "NFS cache", but the point is as
>>> long as mgt server can be decoupled from this "cache" storage, then =
we
>> can decide when/how to use cache storage based on different kind of
>> hypervisor/storage combinations in the future.
>>=20
>> I think we would do well to re-orient the way we think about roles =
and
>> requirements.  Ceph doesn't need a file system to perform a delta =
snapshot
>> operation.  Xen, KVM, and/or VMWare need access to a file system to
>=20
> For Ceph delta snapshot case, it's Ceph has the requirement that needs =
a file system to perform delta =
snapshot(http://ceph.com/docs/next/man/8/rbd/):
>=20
> export-diff [image-name] [dest-path] [-from-snap snapname]
> Exports an incremental diff for an image to dest path (use - for =
stdout). If an initial snapshot is specified, only changes since that =
snapshot are included; otherwise, any regions of the image that contain =
data are included. The end snapshot is specified using the standard =
-snap option or @snap syntax (see below). The image diff format includes =
metadata about image size changes, and the start and end snapshots. It =
efficiently represents discarded or 'zero' regions of the image.
>=20
> The dest-path is either a file, or stdout, if using stdout, then need =
a lot of memory. If using hypervisor's local file system, then the local =
file system may don't have enough space to store the delta diff.

I apologize for failing to read more closely -- I mistakenly assumed you =
were referring to hypervisor snapshots.  To my mind, if a local file =
system is needed by a storage driver to perform an operation then it =
should be a encapsulated within the driver's scope.  The storage layer =
should provide a suitable interface for the driver to acquire/release a =
reservation to the staging/temporary area if it needs it.

For Ceph specifically, stdout can be pushed through a =
BufferedOutputStream and written straight to the object store -- =
skipping the file system.  With this approach, we should be able to keep =
the memory required a fixed size and "pump" it out to the object store.  =
Ideally, we would define the interfaces to provide InputStreams and =
OutputStreams -- creating the potential for the copy operation to be =
implemented in the orchestration code.

>=20
>> perform these operations.  The hypervisor plugin should request a
>> reservation of x size as a file handle from the Storage subsystem.  =
The Ceph
>> driver implements this request by using a staging area + transfer =
operation.
>> This approach encapsulates the operation/rules around the staging =
area from
>> clients, protects against concurrent requests flooding a resource, =
and allows
>> hypervisor-specific behavior/rules to encapsulated in the appropriate =
plugin.
>>=20
>>>=20
>>>> curiosity, is their a Xen expert on the list who can provide a
>>>> high-level description of the coalescing operation -- in =
particular,
>>>> the way it interacts with storage?  I have Googled a bit, and found =
very
>> little information about it.
>>>> Has the object_store branch been tested with VMWare and KVM?  If =
so,
>>>> what operations on these hypervisors have been tested?
>>>=20
>>> Both vmware and KVM is tested, but without S3 support. Haven't have
>> time to take a look at how to use S3 in both hypervisors yet.
>>> For example, we should take a look at how to import a template from =
url
>> into vmware data store, thus, we can eliminate "NFS cache" during =
template
>> import.
>>=20
>> Given the release extension and the impact of these tests on the
>> implementation, we need to test S3 with VMWare and KVM pre-merge.
>=20
> I would like to handle over the implementation of S3(directly use S3 =
without nfs staging area) on both Vmware and KVM to the community, or in =
the next release, or after the merge.
> The reason, is simple, we need to get mgt server part refactor done at =
first, the hypervisor side implementation or optimization can be done =
after the mgt server side refactor. I think what we are doing at the mgt =
server side refactor paves the way for this kind of optimization at the =
hypervisor side.

This begs a larger question for me -- why is the implementation =
hypervisor specific?  Naively, it seems that fitting the current =
hypervisors for the new storage architecture would bring along this =
feature for nearly free.  I remain concerned that we have not adequately =
decoupled the Hypervisor and Storage layers.  As I have stated a few =
(thousand) times now, I am focused on breaking the circular dependency =
between the Hypervisor and Storage layers to avoid this type of feature =
stratification.

>=20
>>=20
>>>=20
>>>>=20
>>>> In reading through the description below, my operation concerns
>>>> remain regarding potential race conditions and resource exhaustion.
>>>> Also, in reading through the description, I think we should find a
>>>> new name for this mechanism.  As Chip has previous mentioned, a =
cache
>>>> implies the following
>>>> characteristics:
>>>>=20
>>>>   1. Optional: Systems can operate without caches just more slowly.
>>>> However, with this mechanism, snapshots on Xen will not function.
>>>=20
>>>=20
>>> I agree on this one.
>>>=20
>>>>   2. Volatility: Caches are backed by durable, non-volitale =
storage.
>>>> Therefore, if the cache's data is lost, it can be rebuilt from the
>>>> backing store and no data will be permanently lost from the system.
>>>> However, this mechanism contains snapshots in-transit to an object
>>>> store.  If the data contained in this "cache" were lost before its
>>>> transfer to the object store completed, the snapshot data would be =
lost.
>>>=20
>>> It's the same thing for file cache on Linux file system. If the file =
cache is not
>> flushed into disk, while the machine lost power, then the data on the =
file
>> cache is lost.
>>> When we backup the snapshot from primary storage to S3, the snapshot =
is
>> copied to "Nfs cache", then immediately, copied from "Nfs cache" into =
S3. If
>> the snapshot on "Nfs cache" is lost, then the snapshot backup is =
failed. User
>> can issue another backup snapshot command in this case.
>>> So I don't think it's an issue.
>>=20
>> The window of opportunity for data loss from a file system sync is =
much
>> narrower for the Linux filesystem that for this staging area.  =
Furthermore,
>> that risk can be largely (if not completely) mitigated with =
battery-backup
>> hardware and/or conservative NFS settings.
>>=20
>> For this staging area, the object store may be unreachable for an =
extended
>> period of time (minutes, hours).  There are no cache flush settings =
or
>> hardware solutions when it becomes unavailable.  If the data is lost =
from the
>> staging area, it will be gone.  I think it is one of the largest =
issues with this
>> approach, and we must be careful to ensure that data can not be lost =
before
>> it is transferred out.
>=20
> I agree. It's not I want use staging area, it's the limitation of =
hypervisor or storage, which can't directly transfer data in/out from S3 =
for some operations.
> I think we agree on the limitation and issues with the staging area, =
but that's the current reality.=20
> If we want to remove staging area totally, need more resources to take =
a look at what can we do for each hypervisor, for each storage. We can't =
finish all the things in just one month.
> If other people is willing to help us in this area, I'll appreciate

I apologize if I haven't clearly expressed my recognition that we =
currently can't avoid the staging area in some circumstances.  I want to =
ensure that we implement it in a robust manner that avoid intro ducting =
instability into implementations using object storage.

> .=20
>=20
>>=20
>>>=20
>>>>=20
>>>> In order to set expectations with users and better frame our design
>>>> conversation, I think it would be appropriate this mechanism as a
>>>> staging,
>>>=20
>>> Ok, seems cache is confusing people, we can use other term, or =
document
>> it clearly, what's the role of the storage.
>>> Yes, it's just a temporary file system, which can be used to store =
some
>> temporary files.
>>>=20
>>>> scratch, or temporary area.  I also recommend removing the notion =
of
>>>> NFS its name as NFS is initial implementation of this mechanism.  =
In
>>>> the future, I can see a desire for local filesystem, RBD, and iSCSI
>> implementations of it.
>>>=20
>>> Agree, any storage can be used as "Cache" storage. If you take a =
look at
>> storagemanagerImpl->createCacheStore, it's nothing related to NFS.
>>>=20
>>>>=20
>>>> In terms of solving the potential race conditions and resource
>>>> exhaustion issues, I don't think an LRU approach will be sufficient
>>>> because the least recently used resource may be still be in use by
>>>> the system.  I think we should look to a reservation model with
>>>> reference counting where files are deleted when once no processes =
are
>>>> accessing them.  The following is a
>>>> (handwave-handwave) overview of the process I think would meet =
these
>>>> requirements:
>>>>=20
>>>> 	1. Request a reservation for the maximum size of the file(s) =
that
>>>> will be processed in the staging area.
>>>> 		- If the file is already in the staging area, increase =
its
>>>> reference count
>>>> 		- If the reservation can not be fulfilled, we can either =
drop
>> the
>>>> process in a retry queue or reject it.
>>>> 	2. Perform work and transfer file(s) to/from the object store
>>>> 	3. Release the file(s) -- decrementing the reference count.  =
When
>>>> the reference count is <=3D 0, delete the file(s) from the staging =
area
>>>=20
>>> I assume the reference count is stored in memory and inside SSVM?
>>> The reference count may not work properly, in case of multiple =
secondary
>> storage VMs and multiple mgt servers. And there may have a lot of =
places
>> other than SSVM can directly use the cached object.
>>> If we store the reference count on file system, then need to take a
>> lock(such as nfs lock, or lock file)to update, while the lock can be =
failed to
>> release, due to all kind of reasons(such as network).
>>=20
>> We could implement reference counting in a number of ways.  The first
>> would be increment a value in the database before command submission =
to
>> the SSVM, and decrement as part of answer processing.  We could =
evaluate
>=20
> I agree, we can add a ref count column in =
template/volume/snapshot_store_ref, which can track how many read users =
of the cached object.
>=20
>> using a distributed framework such as Hazelcast =
(http://www.hazelcast.com)
>> which provides a distributed countdown latch
>> =
(http://www.hazelcast.com/docs/1.9.4/javadoc/com/hazelcast/core/ICount
>> DownLatch.html) across the SSVMs.  We need to avoid POSIX-style file
>=20
>=20
> Good to know.
>=20
>> system locks because they are not consistently implemented/available =
(e.g.
>> OCFS2).
>>=20
>> My first brush thoughts on it would be to use a database table in =
4.2, and
>> evaluate adopting a something like Hazelcast in 4.3.  Personally, I =
would like
>> to see us move away from relying on relational database semantic to
>> implement distributed data structures (counters, locks, etc).  =
However, given
>> the time pressures, I don't think we have the time properly evaluate =
the
>> impact of adopting a more general purpose distributed framework in =
4.2.
>=20
> I agree.
>=20
>>=20
>> =46rom a code perspective, I think it would behove us to implement a =
more
>> functional approach to command execution in order to ensure reference
>> counting, error handling, resource management are handled in a =
consistent
>> manner.  I implemented such an approach in
>> com.cloud.utils.db.GlobalLock#executeWithLock where locking around a
>> particular operation is managed separately form the actual operation =
being
>> performed.
>=20
> I'll take a look at your implementation.
>=20
>>=20
>>>=20
>>> I thought about it yesterday, about how to implement LRU. =
Originally,
>>> I though, we could eliminate race condition and track who is using =
objects
>> stored on cache storage by using state machine For example, whenever =
mgt
>> server wants to use the cached object, mgt server can change the =
state for
>> the cached object to "Copying"(there is a DB entry for each cached =
object),
>> after the copy is finished, then change the state into "Ready", and =
also
>> update "updated" column. It will eliminate the race condition, as =
only one
>> thread can access the cached object, and change its state. But the =
problem of
>> this way, is that, there are cases that multiple reader threads may =
want to
>> read the cached object at the same time: e.g. copy the same cached
>> template to multiple primary storages at the same time.
>>>=20
>>> In order to accommodate multiple readers, I am trying to add a new =
db
>> table to track the users of  the cached object.
>>> The follow will be like the following:
>>> 1. mgt server wants to use the cached object, first, need to check =
the state
>> of the cached object, the state must be in ready state.
>>> 2. mgt server writes a db entry into DB, the entry will contain, the =
id of
>> cached object, the id of cached storage, the issued time. The db =
entry will
>> also contain a state: the state can be =
initial/processing/finished/failed. Mgt
>> server needs to set the state as "processing".
>>> 3. mgt server finishes the operation related the cached object, then =
mark
>> state of above db entry as "finished",  also update the time column =
of above
>> entry.
>>> 4. the above db entries will be removed if the state is not in =
"processing"
>> for a while(let's say one week?), or if the entry is in the =
"processing" state for
>> a while(let's say one day). In this way, mgt server can easily know =
which
>> cached object is used or not used recently, by take a look this db =
table.
>>> 5. If mgt server find a cached object is not used(there is no db =
entry in the
>> above table) for a while(let's say one week), then change the state =
of the
>> cached object into "destroying", then send command to ssvm to destroy =
the
>> object.
>>> 6. There is small window, that mgt server is changing the state of =
cached
>> object into "destroying"(there is no db entry is in "processing" =
state in the
>> above table,), while another thread is trying to copying(as the =
cached object
>> state is still in ready state), both DB operations will success, we =
can hold a DB
>> lock on the cached object entry, before both DB opeations.
>>>=20
>>> How do you think?
>>=20
>> The issue remains that is the least recently used (really accessed) =
object can
>> still be in use by a running process.  One example that pops to mind =
is a
>> popular, large template that has a set of longish running processes =
creating
>> from it.  As I described above, I think you can change issued time to =
a
>> reference count, and add logic to step 3 to decrement/check the =
object
>> count.  With the proper transaction semantics, we provide sufficient
>> consistency guarantees around a reference count.
>=20
> Agree. I only need to track how many readers which are currently using =
cached object. So a ref cnt is enough, I don't even need to create a new =
db table to track the ref cnt, add a new refcnt column on the =
template/snapshot/volume_store_ref is good enough. Every time the ref =
cnt is updated, the "updated" column is got updated also, so that based =
on the ref cnt column and updated column, mgt server will know is there =
any other users using the cached object, and what's the last time, the =
cached object got used, then implement a LRU reclaim algorithm.

I think the safest approach for now is to simply delete the file from =
the staging area when the reference is >=3D 0.  This approach will =
likely incur some additional transfer, but it is simplest path to ensure =
the least amount resource consumption.  If we see performance issues, we =
can evaluate adding an LRU algorithm to hold objects in the staging =
longer.

>=20
>>=20
>> The other part that we must accommodate is resource reservation.  =
Client
>> need to declare the anticipated size of their use before starting an =
operation.
>> The Storage needs to track the amount of space committed vs. used, =
and fail
>> fast when it is clear that the system will not have the resources =
available to
>> fulfill a request.  For 4.2, I think we don;t have the time implement =
a robust
>> queueing/best efforts facility.  For 4.2, I think a checked exception =
indicating
>> temporary resource unavailability will be sufficient for clients to =
determine
>> the best course of recovery action (i.e. error out or retry).
>=20
> The  resource reservation is something that haven't done well in =
cloudstack for a long time. There is no proper resource reservation for =
all the storage related operations, it's likely, the storage will get =
used out, if there are concurrent volume creation operations, as there =
is no lock at the mgt server to check/update storage capacity.=20
> What I am trying to implement resource reservation is that:
> 1. For each storage(primary/secondary, or staging area) has a db entry =
in op_host_capacity, which contains the used/allocated/total size of =
each storage.
> 2. For each allocation operation(there is a common entry: =
datastore->create/delete), need to update above db entry in an atomic =
way:
> Either hold a DB row lock, then update, or implement a CompareAndSet =
method, so that in case of concurrent storage create/delete operations, =
the capacity is been updated properly.
> 3. Before each capacity update, if the used/total is beyond a certain =
threshold, then failed.

Hopefully, this work will lead to a more generic resource reservation =
system within CloudStack.  I think a resource_reservation table with a =
foreign key to the storage entity, a size, creation timestamp, last =
accessed timestamp, and id (UUID) will suffice.  We will also need a =
reservation_reource_lock table with a row per DataStore.  The =
reservation process would perform the following steps:

	1. Acquire a row-level lock from reservation_resource_lock table =
for the DataStore
	2. Sum the reservations for the device and determine if enough =
space exists
	3. If enough space exists, insert a row in the =
resource_reservation with the size, resource id, and UUID of the =
reservation
	4. Release the row-level lock on reservation_resource_lock table =
for the DataStore

Reservation release would follow a similar approach without the =
summation -- just a delete of the reservation by UUID.  As a backstop, =
we also need a reaper thread to kill reservations based on a TTL from =
the last accessed timestamp.

>=20
> There are some known issues with the resource reservation:
> 1. The size of certain objects are unknown during the resource =
reservation, such as the template size(we may need to call httpclient on =
the mgt server to get the size of template, as the template is not been =
downloaded into secondary storage, in case of register template), or the =
snapshot size(in case of copying snapshot from primary to nfs staging =
area, the mgt server doesn't know the size of snapshot before issuing =
the copy command, so don't know how to make the resource reservation)

For templates, we will need to know size to transfer to the object =
store.  For a snapshot, we can start with a reservation for the total =
size of the Volume being snapshotted.  The reservation does not need to =
be precise.  I must be large enough to fit the results of the operation. =
 Therefore, if a Volume is defined to be 10GB in size, but the snapshot =
only occupies a 500MB of space then we reserve 10GB.  We are assured =
that the snapshot operation will not fail due to a lack of disk space.  =
On the downside, we may crowd out other operations, but i would rather =
block other operations than have a race to fill the disk.

> 2. Due to above issue 1, capacity db table is out-of-sync with the =
actual storage usage. No matter how carefully coded at the mgt server, =
capacity info in DB can be out-of-sync with actual physical capacity. =
Need to sync with the info returned by GetStorageStatsCommand.
> 3. Storage over provisioning: current only NFS storage can do over =
provisioning, but I think it should be decided by each storage provider.

Agreed.  The DataStore should be queried for available free space which =
in turn should be implemented by the driver.  Thinking through it, the =
result should a Long where a null value means, essentially, infinite =
space available since most object stores don't really have the notion of =
free space ...

>=20
> I'll implement a simple resource reservation at first.
>=20
>>=20
>>>=20
>>>>=20
>>>> We would also likely want to consider a TTL to purge files after a
>>>> configurable period of inactivity as a backstop against crashed
>>>> processes failing to properly decrementing the reference count.  In
>>>> this model, we will either defer or reject work if resources are =
not
>> available, and we properly bound resources.
>>>=20
>>> Yes, it should be taken into consideration for all the time =
consuming
>> operations.
>>>=20
>>>>=20
>>>> Finally, in terms of decoupling the decision to use of this =
mechanism
>>>> by hypervisor plugins from the storage subsystem, I think we should
>>>> expose methods on the secondary storage services that allow clients
>>>> to explicitly request or create resources using files (i.e.
>>>> java.io.File) instead of streams (e.g. createXXX(File) or
>>>> readXXXAsFile).  These interfaces would provide the storage =
subsystem
>> with the hint that the client requires file access to the
>>>> request resource.   For object store plugins, this hint would be =
used to
>> wrap
>>>> the resource in an object that would transfer in and/out of the =
staging
>> area.
>>>>=20
>>>> Thoughts?
>>>> -John
>>>>=20
>>>> On Jun 3, 2013, at 7:17 PM, Edison Su <Edison.su@citrix.com> wrote:
>>>>=20
>>>>> Let's start a new thread about NFS cache storage issues on =
object_store.
>>>>> First, I'll go through how NFS storage works on master branch, =
then
>>>>> how it
>>>> works on object_store branch, then let's talk about the "issues".
>>>>>=20
>>>>> 0.       Why we need NFS secondary storage? Nfs secondary storage =
is
>> used
>>>> as a place to store templates/snapshots etc, it's zone wide, and =
it's
>>>> widely supported by most of hypervisors(except HyperV). NFS storage
>>>> exists in CloudStack since 1.x. With the rising of object storage,
>>>> like S3/Swift, CloudStack adds the support of Swift in 3.x, and S3 =
in
>>>> 4.0. You may wonder, if S3/Swift is used as the place to store
>>>> templates/snapshots, then why we still need NFS secondary storage?
>>>>>=20
>>>>> There are two reasons for that:
>>>>>=20
>>>>> a.       CloudStack storage code is tightly coupled with NFS =
secondary
>> storage,
>>>> so when adding Swift/S3 support, it's likely to take shortcut, =
leave
>>>> NFS secondary storage as it is.
>>>>>=20
>>>>> b.      Certain hypervisors, and certain storage related =
operations, can not
>>>> directly operate on object storage.
>>>>> Examples:
>>>>>=20
>>>>> b.1 When backing up snapshot(the snapshot taken from xenserver
>>>>> hypervisor) from primary storage to S3 in xenserver
>>>>>=20
>>>>> If there are snapshot chains on the volume, and if we want to
>>>>> coalesce the
>>>> snapshot chains into a new disk, then copy it to S3, we either,
>>>> coalesce the snapshot chains on primary storage, or on an extra
>>>> storage repository (SR) that supported by Xenserver.
>>>>>=20
>>>>> If we coalesce it on primary storage, then may blow up the primary
>>>>> storage,
>>>> as the coalesced new disk may need a lot of space(thinking about, =
the
>>>> new disk will contain all the content in from leaf snapshot, all =
the
>>>> way up to base template), but the primary storage is not planned to
>>>> this operation(cloudstack mgt server is unaware of this operation,
>>>> the mgt server may think the primary storage still has enough space =
to
>> create volumes).
>>>>>=20
>>>>> While xenserver doesn't have API to coalesce snapshots directly to
>>>>> S3, so
>>>> we have to use other storages that supported by Xenserver, that's =
why
>>>> the NFS storage is used during snapshot backup. So what we did is
>>>> that first call xenserver api to coalesce the snapshot to NFS
>>>> storage, then copy the newly created file into S3. This is what we
>>>> did on both master branch and object_store branch.
>>>>>                             b.2 When create volume from snapshot =
if
>>>>> the snapshot is
>>>> stored on S3.
>>>>>                                               If the snapshot is a
>>>>> delta snapshot, we need to
>>>> coalesce them into a new volume. We can't coalesce snapshots =
directly
>>>> on S3, AFAIK, so we have to download the snapshot and its parents
>>>> into somewhere, then coalesce them with xenserver's tools. Again,
>>>> there are two options, one is to download all the snapshots into
>>>> primary storage, or download them into NFS storage:
>>>>>                                              If we download all =
the
>>>>> snapshots into primary
>>>> storage directly from S3, then first we need find a way import
>>>> snapshot from
>>>> S3 into Primary storage(if primary storage is a block device, then
>>>> need extra
>>>> care) and then coalesce them. If we go this way, need to find a
>>>> primary storage with enough space, and even worse, if the primary
>>>> storage is not zone-wide, then later on, we may need to copy the
>>>> volume from one primary storage to another, which is time =
consuming.
>>>>>                                              If we download all =
the
>>>>> snapshots into NFS storage
>>>> from S3, then coalesce them, and then copy the volume to primary
>> storage.
>>>> As the NFS storage is zone wide, so, you can copy the volume into
>>>> whatever primary storage, without extra copy. This is what we did =
in
>>>> master branch and object_store branch.
>>>>>                            b.3, some hypervisors, or some storages
>>>>> do not support
>>>> directly import template into primary storage from a URL. For
>>>> example, if Ceph is used as primary storage, when import a template
>>>> into RBD, need transform a Qcow2 image into RAW disk, then into RBD
>>>> format 2. In order to transform an image from Qcow2 image into RAW
>>>> disk, you need extra file system, either a local file system(this =
is
>>>> what other stack does, which is not scalable to me), or a NFS
>>>> storage(this is what can be done on both master and object_store). =
Or
>>>> one can modify hypervisor or storage to support directly import
>>>> template from S3 into RBD. Here is the link(http://www.mail-
>>>> archive.com/ceph-devel@vger.kernel.org/msg14411.html), that Wido
>> posted.
>>>>>               Anyway, there are so many combination of hypervisors
>>>>> and
>>>> storages: for some hypervisors with zone wide file system based
>> storage(e.g.
>>>> KVM + gluster/NFS as primary storage), you don't need extra nfs =
storage.
>>>> Also if you are using VMware or HyperV, which can import template
>>>> from a URL, regardless which storage your are using, then you don't
>>>> need extra NFS storage. While if you are using xenserver, in order =
to
>>>> create volume from delta snapshot, you will need a NFS storage, or =
if
>>>> you are using KVM + Ceph, you also may need a NFS storage.
>>>>>              Due to above reasons, NFS cache storage is need in
>>>>> certain cases if
>>>> S3 is used as secondary storage. The combination of hypervisors and
>>>> storages are quite complicated, to use cache storage or not, should =
be
>> case by case.
>>>> But as long as cloudstack provides a framework, gives people the
>>>> choice to enable/disable cache storage on their own, then I think =
the
>>>> framework is good enough.
>>>>>=20
>>>>>=20
>>>>> 1.       Then let's talk about how NFS storage works on master =
branch,
>> with
>>>> or without S3.
>>>>> If S3 is not used, here is the how NFS storage is used:
>>>>>=20
>>>>> 1.1   Register a template/ISO: cloudstack downloads the =
template/ISO
>> into
>>>> NFS storage.
>>>>>=20
>>>>> 1.2   Backup snapshot: cloudstack sends a command to xenserver
>>>> hypervisor, issue vdi.copy command copy the snapshot to NFS, for =
kvm,
>>>> directly use "cp" or "qemu-img convert" to copy the snapshot into =
NFS
>>>> storage.
>>>>>=20
>>>>> 1.3   Create volume from snapshot: If the snapshot is a delta =
snapshot,
>>>> coalesce them on NFS storage, then vdi.copy it from NFS to primary
>> storage.
>>>> If it's KVM, use "cp" or "qemu-img convert" to copy the snapshot =
from
>>>> NFS storage to primary storage.
>>>>>=20
>>>>>=20
>>>>>             If S3 is used:
>>>>>=20
>>>>> 1.4   Register a template/ISO: download the template/ISO into NFS
>> storage
>>>> first, then there is background thread, which can upload the
>>>> template/ISO from NFS storage into S3 regularly. The template is in
>>>> Ready state, only means the template is stored on NFS storage, but
>>>> admin doesn't know the template is stored on the S3 or not. Even
>>>> worse, if there are multiple zones, cloudstack will copy the =
template
>>>> from one zone wide NFS storage into another NFS storage in another
>>>> zone, while there is already has a region wide
>>>> S3 available. As the template is not directly uploaded to S3 when
>>>> registering a template, it will take several copy in order to =
spread
>>>> the template into a region wide.
>>>>>=20
>>>>> 1.5   Backup snapshot: cloudstack sends a command to xenserver
>>>> hypervisor, copy the snapshot to NFS storage, then immediately,
>>>> upload the snapshot from NFS storage into S3. The snapshot is in
>>>> Backedup state, not only means the snapshot is in  NFS storage, but =
also
>> means it's stored on S3.
>>>>>=20
>>>>> 1.6   Create volume from snapshot: download the snapshot  and it's
>> parent
>>>> snapshots from S3 into NFS storage, then coalesce and vdi.copy the
>>>> volume from NFS to primary storage.
>>>>>=20
>>>>>=20
>>>>>=20
>>>>> 2.       Then let's talk about how it works on object_store:
>>>>> If S3 is not used, there is ZERO change from master branch. How =
the
>>>>> NFS
>>>> secondary storage works before, is the same on object_store.
>>>>> If S3 is used, and NFS cache storage used also(which is by =
default):
>>>>> 2.1 Register a template/ISO: the template/ISO are directly =
uploaded
>>>>> to S3,
>>>> there is no extra copy to NFS storage. When the template is in =
"Ready"
>> state,
>>>> means the template is stored on S3.                  It implies =
that: the template
>> is
>>>> immediately available in the region as soon as it's in Ready State.
>>>> And admin can clearly knows the status of template on S3, what's
>>>> percentage of the uploading, is it failed or succeed? Also if
>>>> register template failed for some reason, admin can issue the
>>>> register template command again. I would say the change of how to
>>>> register template into S3 is far better than what we did on master =
branch.
>>>>> 2.2 Backup snapshot: it's same as master branch, sends a command =
to
>>>> xenserver host, copy the snapshot into NFS, then upload to S3.
>>>>> 2.3 Create volume from snapshot: it's the same as master branch,
>>>> download snapshot and it's parent snaphots from S3 into NFS, then
>>>> copy it from NFS to primary storage.
>>>>> =46rom above few typical usage cases, you may understand how S3 =
and
>>>>> NFS
>>>> cache storage is used, and what's difference between object_store
>>>> branch and master branch: basically, we only change the way how to
>>>> register a template, nothing else.
>>>>> If S3 is used, and no NFS cache storage is used(it's possible,
>>>>> depends on
>>>> which datamotion strategy is used):
>>>>>  2.4 Register a template/ISO: it's the same as 2.1
>>>>>  2.5 Backup snapshot: export the snapshot from primary storage =
into
>>>>> S3
>>>> directly
>>>>>  2.6 Create volume from snapshot: download snapshots from S3 into
>>>> primary storage directly, then coalesce and create volume from it.
>>>>>=20
>>>>>        Hope above explanation will tell the truth how the system
>>>>> works on
>>>> object_store, and clarify the misconception/misunderstanding  about
>>>> object_store branch. Even the change is huge, we still maintain the
>>>> back compatibility. If you don't want to use S3, only want to
>>>> existing NFS storage, it's definitely OK, it works the same as
>>>> before. If you want to use S3, we provide a better S3 =
implementation
>>>> when registering template/ISO. If you want to use S3 without NFS
>>>> storage, that's also definitely OK,  the framework is quite =
flexible to
>> accommodate different solutions.
>>>>>=20
>>>>> Ok, let's talk  about the NFS storage cache issues.
>>>>> The issue about NFS cache storage is discussed in several threads,
>>>>> back and
>>>> forth. All in all, the NFs cache storage is only one usage case out
>>>> of three usage cases supported by object_store branch. It's not
>>>> something that if it has issue, then everything doesn't work.
>>>>> In above 2.2 and 2.3, it shows how the NFS cache storage is =
involved
>>>>> during
>>>> snapshot related operations. The complains about there is no aging
>>>> policy, no capacity planner for NFS cache storage, is happened when
>>>> download a snapshot from S3 into NFS, or copy a snapshot from =
primary
>>>> storage into NFS, or download template from S3 into NFS. Yes, it's =
an
>>>> issue, the NFS cache storage can be used out, if there is no =
capacity
>>>> planner, and no aging out policy. But can it be fixed? Is it a =
design issue?
>>>>> Let's talk the code: Here is the code related to NFS cache =
storage,
>>>>> not much, only one class depends on NFS cache storage:
>>>>> =
https://git-wip-us.apache.org/repos/asf?p=3Dcloudstack.git;a=3Dblob;f=3Den=

>>>>> gi
>>>>>=20
>>>>=20
>> ne/storage/datamotion/src/org/apache/cloudstack/storage/motion/Ancien
>>>> t
>>>>>=20
>>>>=20
>> DataMotionStrategy.java;h=3Da01d2d30139f70ad8c907b6d6bc9759d47dcc2d6;h
>>>> b=3D
>>>>> refs/heads/object_store Take copyVolumeFromSnapshot as example,
>>>> which
>>>>> will be called when create Volume from snapshot, if first calls
>>>>> cacheSnapshotChain, which will call cacheMgr.createCacheObject to
>>>>> download the snapshot into NFs cache storage.
>>>>> StorageCacheManagerImpl-> createCacheObject is the only place to
>>>>> create objects on NFs cache storage, the code is at
>>>>> =
https://git-wip-us.apache.org/repos/asf?p=3Dcloudstack.git;a=3Dblob;f=3Den=

>>>>> gi
>>>>>=20
>>>>=20
>> ne/storage/cache/src/org/apache/cloudstack/storage/cache/manager/Stor
>>>> a
>>>>>=20
>>>>=20
>> geCacheManagerImpl.java;h=3Dcb5ea106fed3e5d2135dca7d98aede13effcf7d9;
>>>> hb=3D
>>>>> refs/heads/object_store In createCacheObject, it will first find =
out
>>>>> a cache storage, in case there are multiple cache storages =
available in a
>> scope:
>>>>> DataStore cacheStore =3D this.getCacheStorage(scope); =
getCacheStorage
>>>>> will call StorageCacheAllocator to find out a proper NFS cache
>>>>> storage. So
>>>> StorageCacheAllocator is the place to choose NFS cache storage =
based
>>>> on certain criteria, the current implementation only randomly =
choose
>>>> one of them, we can add a new allocator algorithm, based on =
capacity etc,
>> etc.
>>>>> Regarding capacity reservation, there is already a table, called
>>>> op_host_capacity which has entry for NFS secondary storage, we can
>>>> reuse this entry to store capacity information about NFS cache
>>>> storages(such as, total size, available/used capacity etc). So when
>>>> every call createCacheObject, we can call StorageCacheAllocator to
>>>> find out a proper NFS storage based on first fit criteria, then
>>>> increase used capacity in op_host_capacity table. If the create =
cache
>> object failed, return the capacity to op_host_capacity.
>>>>>=20
>>>>> Regarding the aging out policy, we can start a background thread =
on
>>>>> mgt
>>>> server, which will scan all the objects created on NFS cache
>>>> storage(the tables called: snapshot_store_ref, template_store_ref,
>>>> volume_store_ref), each entry of these tables has a column called:
>>>> updated, every time, when the object's state is changed, the =
"updated"
>> column will be got updated also.
>>>> When the object's state is changed? Every time, when the object is
>>>> used in some contexts(such as copy the snapshot on NFS cache =
storage
>>>> into somewhere), the object's state will be changed  accordingly,
>>>> such as "Copying", means the object is being copied to some place,
>>>> which is exactly the information we need to implement LRU =
algorithm.
>>>>>=20
>>>>> How do you guys think about the fix? If you have better solution,
>>>>> please let
>>>> me know.
>>>>>=20
>>>>>=20
>>>=20
>=20