Return-Path: X-Original-To: apmail-cloudstack-dev-archive@www.apache.org Delivered-To: apmail-cloudstack-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F35881079D for ; Fri, 7 Jun 2013 21:39:24 +0000 (UTC) Received: (qmail 24569 invoked by uid 500); 7 Jun 2013 21:39:24 -0000 Delivered-To: apmail-cloudstack-dev-archive@cloudstack.apache.org Received: (qmail 24537 invoked by uid 500); 7 Jun 2013 21:39:24 -0000 Mailing-List: contact dev-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list dev@cloudstack.apache.org Received: (qmail 24529 invoked by uid 99); 7 Jun 2013 21:39:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Jun 2013 21:39:24 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jburwell@basho.com designates 209.85.216.43 as permitted sender) Received: from [209.85.216.43] (HELO mail-qa0-f43.google.com) (209.85.216.43) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Jun 2013 21:39:20 +0000 Received: by mail-qa0-f43.google.com with SMTP id d13so1357425qak.9 for ; Fri, 07 Jun 2013 14:38:59 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer :x-gm-message-state; bh=c0KvYHw1642+YqbI9nCNl6hYexerYnpYnvPJb7AIxTE=; b=TbvZCJRTc1vgN3jw8xM3rep9FBbXEqcrz3WtxZ5POliQiXEBT8ytm64xuGTmZRE3E+ y1AmE9Fw/GL/RbVt9nSvz4Xk6Q1wumc7mM+49ygyF0fmmt8ZVS7n3Q+JS4UmIonjQLD+ vL3GLJPl0kbCe3Ooz4Of13iEzsD3WNIKzfqL/JJgDP7sj76d8g+e0jVDWEfFcxc+kRLp NvD2ltmd/7ufm20oSKqv4KhA4zobnri7RchmHy85+Au+7+3E4Iaog7jy2so9fVymFv6H CKDH7l8QhC1YJC7njtgD7gNnLjHq4sZUPz6Jl3Thnkypxj+GovwaW4GyvSHTj527EDPD mnBA== X-Received: by 10.49.85.131 with SMTP id h3mr718871qez.42.1370641139187; Fri, 07 Jun 2013 14:38:59 -0700 (PDT) Received: from jburwell-basho.cockamamy.net (c-98-218-146-14.hsd1.va.comcast.net. [98.218.146.14]) by mx.google.com with ESMTPSA id y14sm6421545qac.0.2013.06.07.14.38.57 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 07 Jun 2013 14:38:58 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\)) Subject: Re: [DISCUSS] NFS cache storage issue on object_store From: John Burwell In-Reply-To: <77B337AF224FD84CBF8401947098DD87038B85@SJCPEX01CL01.citrite.net> Date: Fri, 7 Jun 2013 17:38:56 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <8D81A089-C160-48B5-BC44-358867847574@basho.com> References: <77B337AF224FD84CBF8401947098DD87036A76@SJCPEX01CL01.citrite.net> <136BD6D4-62F2-4074-8928-29C79F36EDB3@basho.com> <77B337AF224FD84CBF8401947098DD870383FA@SJCPEX01CL01.citrite.net> <170CE392-7640-4ECE-88FC-E3882D8F9532@basho.com> <77B337AF224FD84CBF8401947098DD87038B85@SJCPEX01CL01.citrite.net> To: dev@cloudstack.apache.org X-Mailer: Apple Mail (2.1503) X-Gm-Message-State: ALoCoQmJ8cqFVDuImdF/wh6uLzTta03S8O5zmvbKaPcoi9F6fog3CEvke5YpsyM8eagmiW51i6qD X-Virus-Checked: Checked by ClamAV on apache.org Edison, Please see my commons in-line below. Thanks, -John On Jun 6, 2013, at 6:43 PM, Edison Su wrote: >=20 >=20 >> -----Original Message----- >> From: John Burwell [mailto:jburwell@basho.com] >> Sent: Thursday, June 06, 2013 7:47 AM >> To: dev@cloudstack.apache.org >> Subject: Re: [DISCUSS] NFS cache storage issue on object_store >>=20 >> Edison, >>=20 >> Please my comments in-line below. >>=20 >> Thanks, >> -John >>=20 >> On Jun 5, 2013, at 6:55 PM, Edison Su wrote: >>=20 >>>=20 >>>=20 >>>> -----Original Message----- >>>> From: John Burwell [mailto:jburwell@basho.com] >>>> Sent: Wednesday, June 05, 2013 1:04 PM >>>> To: dev@cloudstack.apache.org >>>> Subject: Re: [DISCUSS] NFS cache storage issue on object_store >>>>=20 >>>> Edison, >>>>=20 >>>> You have provided some great information below which helps greatly = to >>>> understand the role of the "NFS cache" mechanism. To summarize, = this >>>> mechanism is only currently required for Xen snapshot operations >>>> driven by Xen's coalescing operations. Is my understanding = correct? >>>> Just out of >>>=20 >>> I think Ceph may still need "NFS cache", for example, during delta = snapshot >> backup: >>> http://ceph.com/dev-notes/incremental-snapshots-with-rbd/ >>> You need to create a delta snapshot into a file, then upload the = file into S3 >>>=20 >>> For KVM, if the snapshot is taken on qcow2, then need to copy the >> snapshot into a file system, then backup it to S3. >>>=20 >>> Another usage case for "NFS cache " is to cache template stored on = S3, if >> there is no zone-wide primary storage. We need to download template = from >> S3 into every primary storage, if there is no cache, each download = will take a >> while: comparing download template directly from S3(if the S3 is = region wide) >> with download from a zone wide "cache" storage, I would say, the = download >> from zone wide cache storage should be faster than from region wide = S3. If >> there is no zone wide primary storage, then we will download the = template >> from S3 several times, which is quite time consuming. >>>=20 >>>=20 >>> There may have other places to use "NFS cache", but the point is as >>> long as mgt server can be decoupled from this "cache" storage, then = we >> can decide when/how to use cache storage based on different kind of >> hypervisor/storage combinations in the future. >>=20 >> I think we would do well to re-orient the way we think about roles = and >> requirements. Ceph doesn't need a file system to perform a delta = snapshot >> operation. Xen, KVM, and/or VMWare need access to a file system to >=20 > For Ceph delta snapshot case, it's Ceph has the requirement that needs = a file system to perform delta = snapshot(http://ceph.com/docs/next/man/8/rbd/): >=20 > export-diff [image-name] [dest-path] [-from-snap snapname] > Exports an incremental diff for an image to dest path (use - for = stdout). If an initial snapshot is specified, only changes since that = snapshot are included; otherwise, any regions of the image that contain = data are included. The end snapshot is specified using the standard = -snap option or @snap syntax (see below). The image diff format includes = metadata about image size changes, and the start and end snapshots. It = efficiently represents discarded or 'zero' regions of the image. >=20 > The dest-path is either a file, or stdout, if using stdout, then need = a lot of memory. If using hypervisor's local file system, then the local = file system may don't have enough space to store the delta diff. I apologize for failing to read more closely -- I mistakenly assumed you = were referring to hypervisor snapshots. To my mind, if a local file = system is needed by a storage driver to perform an operation then it = should be a encapsulated within the driver's scope. The storage layer = should provide a suitable interface for the driver to acquire/release a = reservation to the staging/temporary area if it needs it. For Ceph specifically, stdout can be pushed through a = BufferedOutputStream and written straight to the object store -- = skipping the file system. With this approach, we should be able to keep = the memory required a fixed size and "pump" it out to the object store. = Ideally, we would define the interfaces to provide InputStreams and = OutputStreams -- creating the potential for the copy operation to be = implemented in the orchestration code. >=20 >> perform these operations. The hypervisor plugin should request a >> reservation of x size as a file handle from the Storage subsystem. = The Ceph >> driver implements this request by using a staging area + transfer = operation. >> This approach encapsulates the operation/rules around the staging = area from >> clients, protects against concurrent requests flooding a resource, = and allows >> hypervisor-specific behavior/rules to encapsulated in the appropriate = plugin. >>=20 >>>=20 >>>> curiosity, is their a Xen expert on the list who can provide a >>>> high-level description of the coalescing operation -- in = particular, >>>> the way it interacts with storage? I have Googled a bit, and found = very >> little information about it. >>>> Has the object_store branch been tested with VMWare and KVM? If = so, >>>> what operations on these hypervisors have been tested? >>>=20 >>> Both vmware and KVM is tested, but without S3 support. Haven't have >> time to take a look at how to use S3 in both hypervisors yet. >>> For example, we should take a look at how to import a template from = url >> into vmware data store, thus, we can eliminate "NFS cache" during = template >> import. >>=20 >> Given the release extension and the impact of these tests on the >> implementation, we need to test S3 with VMWare and KVM pre-merge. >=20 > I would like to handle over the implementation of S3(directly use S3 = without nfs staging area) on both Vmware and KVM to the community, or in = the next release, or after the merge. > The reason, is simple, we need to get mgt server part refactor done at = first, the hypervisor side implementation or optimization can be done = after the mgt server side refactor. I think what we are doing at the mgt = server side refactor paves the way for this kind of optimization at the = hypervisor side. This begs a larger question for me -- why is the implementation = hypervisor specific? Naively, it seems that fitting the current = hypervisors for the new storage architecture would bring along this = feature for nearly free. I remain concerned that we have not adequately = decoupled the Hypervisor and Storage layers. As I have stated a few = (thousand) times now, I am focused on breaking the circular dependency = between the Hypervisor and Storage layers to avoid this type of feature = stratification. >=20 >>=20 >>>=20 >>>>=20 >>>> In reading through the description below, my operation concerns >>>> remain regarding potential race conditions and resource exhaustion. >>>> Also, in reading through the description, I think we should find a >>>> new name for this mechanism. As Chip has previous mentioned, a = cache >>>> implies the following >>>> characteristics: >>>>=20 >>>> 1. Optional: Systems can operate without caches just more slowly. >>>> However, with this mechanism, snapshots on Xen will not function. >>>=20 >>>=20 >>> I agree on this one. >>>=20 >>>> 2. Volatility: Caches are backed by durable, non-volitale = storage. >>>> Therefore, if the cache's data is lost, it can be rebuilt from the >>>> backing store and no data will be permanently lost from the system. >>>> However, this mechanism contains snapshots in-transit to an object >>>> store. If the data contained in this "cache" were lost before its >>>> transfer to the object store completed, the snapshot data would be = lost. >>>=20 >>> It's the same thing for file cache on Linux file system. If the file = cache is not >> flushed into disk, while the machine lost power, then the data on the = file >> cache is lost. >>> When we backup the snapshot from primary storage to S3, the snapshot = is >> copied to "Nfs cache", then immediately, copied from "Nfs cache" into = S3. If >> the snapshot on "Nfs cache" is lost, then the snapshot backup is = failed. User >> can issue another backup snapshot command in this case. >>> So I don't think it's an issue. >>=20 >> The window of opportunity for data loss from a file system sync is = much >> narrower for the Linux filesystem that for this staging area. = Furthermore, >> that risk can be largely (if not completely) mitigated with = battery-backup >> hardware and/or conservative NFS settings. >>=20 >> For this staging area, the object store may be unreachable for an = extended >> period of time (minutes, hours). There are no cache flush settings = or >> hardware solutions when it becomes unavailable. If the data is lost = from the >> staging area, it will be gone. I think it is one of the largest = issues with this >> approach, and we must be careful to ensure that data can not be lost = before >> it is transferred out. >=20 > I agree. It's not I want use staging area, it's the limitation of = hypervisor or storage, which can't directly transfer data in/out from S3 = for some operations. > I think we agree on the limitation and issues with the staging area, = but that's the current reality.=20 > If we want to remove staging area totally, need more resources to take = a look at what can we do for each hypervisor, for each storage. We can't = finish all the things in just one month. > If other people is willing to help us in this area, I'll appreciate I apologize if I haven't clearly expressed my recognition that we = currently can't avoid the staging area in some circumstances. I want to = ensure that we implement it in a robust manner that avoid intro ducting = instability into implementations using object storage. > .=20 >=20 >>=20 >>>=20 >>>>=20 >>>> In order to set expectations with users and better frame our design >>>> conversation, I think it would be appropriate this mechanism as a >>>> staging, >>>=20 >>> Ok, seems cache is confusing people, we can use other term, or = document >> it clearly, what's the role of the storage. >>> Yes, it's just a temporary file system, which can be used to store = some >> temporary files. >>>=20 >>>> scratch, or temporary area. I also recommend removing the notion = of >>>> NFS its name as NFS is initial implementation of this mechanism. = In >>>> the future, I can see a desire for local filesystem, RBD, and iSCSI >> implementations of it. >>>=20 >>> Agree, any storage can be used as "Cache" storage. If you take a = look at >> storagemanagerImpl->createCacheStore, it's nothing related to NFS. >>>=20 >>>>=20 >>>> In terms of solving the potential race conditions and resource >>>> exhaustion issues, I don't think an LRU approach will be sufficient >>>> because the least recently used resource may be still be in use by >>>> the system. I think we should look to a reservation model with >>>> reference counting where files are deleted when once no processes = are >>>> accessing them. The following is a >>>> (handwave-handwave) overview of the process I think would meet = these >>>> requirements: >>>>=20 >>>> 1. Request a reservation for the maximum size of the file(s) = that >>>> will be processed in the staging area. >>>> - If the file is already in the staging area, increase = its >>>> reference count >>>> - If the reservation can not be fulfilled, we can either = drop >> the >>>> process in a retry queue or reject it. >>>> 2. Perform work and transfer file(s) to/from the object store >>>> 3. Release the file(s) -- decrementing the reference count. = When >>>> the reference count is <=3D 0, delete the file(s) from the staging = area >>>=20 >>> I assume the reference count is stored in memory and inside SSVM? >>> The reference count may not work properly, in case of multiple = secondary >> storage VMs and multiple mgt servers. And there may have a lot of = places >> other than SSVM can directly use the cached object. >>> If we store the reference count on file system, then need to take a >> lock(such as nfs lock, or lock file)to update, while the lock can be = failed to >> release, due to all kind of reasons(such as network). >>=20 >> We could implement reference counting in a number of ways. The first >> would be increment a value in the database before command submission = to >> the SSVM, and decrement as part of answer processing. We could = evaluate >=20 > I agree, we can add a ref count column in = template/volume/snapshot_store_ref, which can track how many read users = of the cached object. >=20 >> using a distributed framework such as Hazelcast = (http://www.hazelcast.com) >> which provides a distributed countdown latch >> = (http://www.hazelcast.com/docs/1.9.4/javadoc/com/hazelcast/core/ICount >> DownLatch.html) across the SSVMs. We need to avoid POSIX-style file >=20 >=20 > Good to know. >=20 >> system locks because they are not consistently implemented/available = (e.g. >> OCFS2). >>=20 >> My first brush thoughts on it would be to use a database table in = 4.2, and >> evaluate adopting a something like Hazelcast in 4.3. Personally, I = would like >> to see us move away from relying on relational database semantic to >> implement distributed data structures (counters, locks, etc). = However, given >> the time pressures, I don't think we have the time properly evaluate = the >> impact of adopting a more general purpose distributed framework in = 4.2. >=20 > I agree. >=20 >>=20 >> =46rom a code perspective, I think it would behove us to implement a = more >> functional approach to command execution in order to ensure reference >> counting, error handling, resource management are handled in a = consistent >> manner. I implemented such an approach in >> com.cloud.utils.db.GlobalLock#executeWithLock where locking around a >> particular operation is managed separately form the actual operation = being >> performed. >=20 > I'll take a look at your implementation. >=20 >>=20 >>>=20 >>> I thought about it yesterday, about how to implement LRU. = Originally, >>> I though, we could eliminate race condition and track who is using = objects >> stored on cache storage by using state machine For example, whenever = mgt >> server wants to use the cached object, mgt server can change the = state for >> the cached object to "Copying"(there is a DB entry for each cached = object), >> after the copy is finished, then change the state into "Ready", and = also >> update "updated" column. It will eliminate the race condition, as = only one >> thread can access the cached object, and change its state. But the = problem of >> this way, is that, there are cases that multiple reader threads may = want to >> read the cached object at the same time: e.g. copy the same cached >> template to multiple primary storages at the same time. >>>=20 >>> In order to accommodate multiple readers, I am trying to add a new = db >> table to track the users of the cached object. >>> The follow will be like the following: >>> 1. mgt server wants to use the cached object, first, need to check = the state >> of the cached object, the state must be in ready state. >>> 2. mgt server writes a db entry into DB, the entry will contain, the = id of >> cached object, the id of cached storage, the issued time. The db = entry will >> also contain a state: the state can be = initial/processing/finished/failed. Mgt >> server needs to set the state as "processing". >>> 3. mgt server finishes the operation related the cached object, then = mark >> state of above db entry as "finished", also update the time column = of above >> entry. >>> 4. the above db entries will be removed if the state is not in = "processing" >> for a while(let's say one week?), or if the entry is in the = "processing" state for >> a while(let's say one day). In this way, mgt server can easily know = which >> cached object is used or not used recently, by take a look this db = table. >>> 5. If mgt server find a cached object is not used(there is no db = entry in the >> above table) for a while(let's say one week), then change the state = of the >> cached object into "destroying", then send command to ssvm to destroy = the >> object. >>> 6. There is small window, that mgt server is changing the state of = cached >> object into "destroying"(there is no db entry is in "processing" = state in the >> above table,), while another thread is trying to copying(as the = cached object >> state is still in ready state), both DB operations will success, we = can hold a DB >> lock on the cached object entry, before both DB opeations. >>>=20 >>> How do you think? >>=20 >> The issue remains that is the least recently used (really accessed) = object can >> still be in use by a running process. One example that pops to mind = is a >> popular, large template that has a set of longish running processes = creating >> from it. As I described above, I think you can change issued time to = a >> reference count, and add logic to step 3 to decrement/check the = object >> count. With the proper transaction semantics, we provide sufficient >> consistency guarantees around a reference count. >=20 > Agree. I only need to track how many readers which are currently using = cached object. So a ref cnt is enough, I don't even need to create a new = db table to track the ref cnt, add a new refcnt column on the = template/snapshot/volume_store_ref is good enough. Every time the ref = cnt is updated, the "updated" column is got updated also, so that based = on the ref cnt column and updated column, mgt server will know is there = any other users using the cached object, and what's the last time, the = cached object got used, then implement a LRU reclaim algorithm. I think the safest approach for now is to simply delete the file from = the staging area when the reference is >=3D 0. This approach will = likely incur some additional transfer, but it is simplest path to ensure = the least amount resource consumption. If we see performance issues, we = can evaluate adding an LRU algorithm to hold objects in the staging = longer. >=20 >>=20 >> The other part that we must accommodate is resource reservation. = Client >> need to declare the anticipated size of their use before starting an = operation. >> The Storage needs to track the amount of space committed vs. used, = and fail >> fast when it is clear that the system will not have the resources = available to >> fulfill a request. For 4.2, I think we don;t have the time implement = a robust >> queueing/best efforts facility. For 4.2, I think a checked exception = indicating >> temporary resource unavailability will be sufficient for clients to = determine >> the best course of recovery action (i.e. error out or retry). >=20 > The resource reservation is something that haven't done well in = cloudstack for a long time. There is no proper resource reservation for = all the storage related operations, it's likely, the storage will get = used out, if there are concurrent volume creation operations, as there = is no lock at the mgt server to check/update storage capacity.=20 > What I am trying to implement resource reservation is that: > 1. For each storage(primary/secondary, or staging area) has a db entry = in op_host_capacity, which contains the used/allocated/total size of = each storage. > 2. For each allocation operation(there is a common entry: = datastore->create/delete), need to update above db entry in an atomic = way: > Either hold a DB row lock, then update, or implement a CompareAndSet = method, so that in case of concurrent storage create/delete operations, = the capacity is been updated properly. > 3. Before each capacity update, if the used/total is beyond a certain = threshold, then failed. Hopefully, this work will lead to a more generic resource reservation = system within CloudStack. I think a resource_reservation table with a = foreign key to the storage entity, a size, creation timestamp, last = accessed timestamp, and id (UUID) will suffice. We will also need a = reservation_reource_lock table with a row per DataStore. The = reservation process would perform the following steps: 1. Acquire a row-level lock from reservation_resource_lock table = for the DataStore 2. Sum the reservations for the device and determine if enough = space exists 3. If enough space exists, insert a row in the = resource_reservation with the size, resource id, and UUID of the = reservation 4. Release the row-level lock on reservation_resource_lock table = for the DataStore Reservation release would follow a similar approach without the = summation -- just a delete of the reservation by UUID. As a backstop, = we also need a reaper thread to kill reservations based on a TTL from = the last accessed timestamp. >=20 > There are some known issues with the resource reservation: > 1. The size of certain objects are unknown during the resource = reservation, such as the template size(we may need to call httpclient on = the mgt server to get the size of template, as the template is not been = downloaded into secondary storage, in case of register template), or the = snapshot size(in case of copying snapshot from primary to nfs staging = area, the mgt server doesn't know the size of snapshot before issuing = the copy command, so don't know how to make the resource reservation) For templates, we will need to know size to transfer to the object = store. For a snapshot, we can start with a reservation for the total = size of the Volume being snapshotted. The reservation does not need to = be precise. I must be large enough to fit the results of the operation. = Therefore, if a Volume is defined to be 10GB in size, but the snapshot = only occupies a 500MB of space then we reserve 10GB. We are assured = that the snapshot operation will not fail due to a lack of disk space. = On the downside, we may crowd out other operations, but i would rather = block other operations than have a race to fill the disk. > 2. Due to above issue 1, capacity db table is out-of-sync with the = actual storage usage. No matter how carefully coded at the mgt server, = capacity info in DB can be out-of-sync with actual physical capacity. = Need to sync with the info returned by GetStorageStatsCommand. > 3. Storage over provisioning: current only NFS storage can do over = provisioning, but I think it should be decided by each storage provider. Agreed. The DataStore should be queried for available free space which = in turn should be implemented by the driver. Thinking through it, the = result should a Long where a null value means, essentially, infinite = space available since most object stores don't really have the notion of = free space ... >=20 > I'll implement a simple resource reservation at first. >=20 >>=20 >>>=20 >>>>=20 >>>> We would also likely want to consider a TTL to purge files after a >>>> configurable period of inactivity as a backstop against crashed >>>> processes failing to properly decrementing the reference count. In >>>> this model, we will either defer or reject work if resources are = not >> available, and we properly bound resources. >>>=20 >>> Yes, it should be taken into consideration for all the time = consuming >> operations. >>>=20 >>>>=20 >>>> Finally, in terms of decoupling the decision to use of this = mechanism >>>> by hypervisor plugins from the storage subsystem, I think we should >>>> expose methods on the secondary storage services that allow clients >>>> to explicitly request or create resources using files (i.e. >>>> java.io.File) instead of streams (e.g. createXXX(File) or >>>> readXXXAsFile). These interfaces would provide the storage = subsystem >> with the hint that the client requires file access to the >>>> request resource. For object store plugins, this hint would be = used to >> wrap >>>> the resource in an object that would transfer in and/out of the = staging >> area. >>>>=20 >>>> Thoughts? >>>> -John >>>>=20 >>>> On Jun 3, 2013, at 7:17 PM, Edison Su wrote: >>>>=20 >>>>> Let's start a new thread about NFS cache storage issues on = object_store. >>>>> First, I'll go through how NFS storage works on master branch, = then >>>>> how it >>>> works on object_store branch, then let's talk about the "issues". >>>>>=20 >>>>> 0. Why we need NFS secondary storage? Nfs secondary storage = is >> used >>>> as a place to store templates/snapshots etc, it's zone wide, and = it's >>>> widely supported by most of hypervisors(except HyperV). NFS storage >>>> exists in CloudStack since 1.x. With the rising of object storage, >>>> like S3/Swift, CloudStack adds the support of Swift in 3.x, and S3 = in >>>> 4.0. You may wonder, if S3/Swift is used as the place to store >>>> templates/snapshots, then why we still need NFS secondary storage? >>>>>=20 >>>>> There are two reasons for that: >>>>>=20 >>>>> a. CloudStack storage code is tightly coupled with NFS = secondary >> storage, >>>> so when adding Swift/S3 support, it's likely to take shortcut, = leave >>>> NFS secondary storage as it is. >>>>>=20 >>>>> b. Certain hypervisors, and certain storage related = operations, can not >>>> directly operate on object storage. >>>>> Examples: >>>>>=20 >>>>> b.1 When backing up snapshot(the snapshot taken from xenserver >>>>> hypervisor) from primary storage to S3 in xenserver >>>>>=20 >>>>> If there are snapshot chains on the volume, and if we want to >>>>> coalesce the >>>> snapshot chains into a new disk, then copy it to S3, we either, >>>> coalesce the snapshot chains on primary storage, or on an extra >>>> storage repository (SR) that supported by Xenserver. >>>>>=20 >>>>> If we coalesce it on primary storage, then may blow up the primary >>>>> storage, >>>> as the coalesced new disk may need a lot of space(thinking about, = the >>>> new disk will contain all the content in from leaf snapshot, all = the >>>> way up to base template), but the primary storage is not planned to >>>> this operation(cloudstack mgt server is unaware of this operation, >>>> the mgt server may think the primary storage still has enough space = to >> create volumes). >>>>>=20 >>>>> While xenserver doesn't have API to coalesce snapshots directly to >>>>> S3, so >>>> we have to use other storages that supported by Xenserver, that's = why >>>> the NFS storage is used during snapshot backup. So what we did is >>>> that first call xenserver api to coalesce the snapshot to NFS >>>> storage, then copy the newly created file into S3. This is what we >>>> did on both master branch and object_store branch. >>>>> b.2 When create volume from snapshot = if >>>>> the snapshot is >>>> stored on S3. >>>>> If the snapshot is a >>>>> delta snapshot, we need to >>>> coalesce them into a new volume. We can't coalesce snapshots = directly >>>> on S3, AFAIK, so we have to download the snapshot and its parents >>>> into somewhere, then coalesce them with xenserver's tools. Again, >>>> there are two options, one is to download all the snapshots into >>>> primary storage, or download them into NFS storage: >>>>> If we download all = the >>>>> snapshots into primary >>>> storage directly from S3, then first we need find a way import >>>> snapshot from >>>> S3 into Primary storage(if primary storage is a block device, then >>>> need extra >>>> care) and then coalesce them. If we go this way, need to find a >>>> primary storage with enough space, and even worse, if the primary >>>> storage is not zone-wide, then later on, we may need to copy the >>>> volume from one primary storage to another, which is time = consuming. >>>>> If we download all = the >>>>> snapshots into NFS storage >>>> from S3, then coalesce them, and then copy the volume to primary >> storage. >>>> As the NFS storage is zone wide, so, you can copy the volume into >>>> whatever primary storage, without extra copy. This is what we did = in >>>> master branch and object_store branch. >>>>> b.3, some hypervisors, or some storages >>>>> do not support >>>> directly import template into primary storage from a URL. For >>>> example, if Ceph is used as primary storage, when import a template >>>> into RBD, need transform a Qcow2 image into RAW disk, then into RBD >>>> format 2. In order to transform an image from Qcow2 image into RAW >>>> disk, you need extra file system, either a local file system(this = is >>>> what other stack does, which is not scalable to me), or a NFS >>>> storage(this is what can be done on both master and object_store). = Or >>>> one can modify hypervisor or storage to support directly import >>>> template from S3 into RBD. Here is the link(http://www.mail- >>>> archive.com/ceph-devel@vger.kernel.org/msg14411.html), that Wido >> posted. >>>>> Anyway, there are so many combination of hypervisors >>>>> and >>>> storages: for some hypervisors with zone wide file system based >> storage(e.g. >>>> KVM + gluster/NFS as primary storage), you don't need extra nfs = storage. >>>> Also if you are using VMware or HyperV, which can import template >>>> from a URL, regardless which storage your are using, then you don't >>>> need extra NFS storage. While if you are using xenserver, in order = to >>>> create volume from delta snapshot, you will need a NFS storage, or = if >>>> you are using KVM + Ceph, you also may need a NFS storage. >>>>> Due to above reasons, NFS cache storage is need in >>>>> certain cases if >>>> S3 is used as secondary storage. The combination of hypervisors and >>>> storages are quite complicated, to use cache storage or not, should = be >> case by case. >>>> But as long as cloudstack provides a framework, gives people the >>>> choice to enable/disable cache storage on their own, then I think = the >>>> framework is good enough. >>>>>=20 >>>>>=20 >>>>> 1. Then let's talk about how NFS storage works on master = branch, >> with >>>> or without S3. >>>>> If S3 is not used, here is the how NFS storage is used: >>>>>=20 >>>>> 1.1 Register a template/ISO: cloudstack downloads the = template/ISO >> into >>>> NFS storage. >>>>>=20 >>>>> 1.2 Backup snapshot: cloudstack sends a command to xenserver >>>> hypervisor, issue vdi.copy command copy the snapshot to NFS, for = kvm, >>>> directly use "cp" or "qemu-img convert" to copy the snapshot into = NFS >>>> storage. >>>>>=20 >>>>> 1.3 Create volume from snapshot: If the snapshot is a delta = snapshot, >>>> coalesce them on NFS storage, then vdi.copy it from NFS to primary >> storage. >>>> If it's KVM, use "cp" or "qemu-img convert" to copy the snapshot = from >>>> NFS storage to primary storage. >>>>>=20 >>>>>=20 >>>>> If S3 is used: >>>>>=20 >>>>> 1.4 Register a template/ISO: download the template/ISO into NFS >> storage >>>> first, then there is background thread, which can upload the >>>> template/ISO from NFS storage into S3 regularly. The template is in >>>> Ready state, only means the template is stored on NFS storage, but >>>> admin doesn't know the template is stored on the S3 or not. Even >>>> worse, if there are multiple zones, cloudstack will copy the = template >>>> from one zone wide NFS storage into another NFS storage in another >>>> zone, while there is already has a region wide >>>> S3 available. As the template is not directly uploaded to S3 when >>>> registering a template, it will take several copy in order to = spread >>>> the template into a region wide. >>>>>=20 >>>>> 1.5 Backup snapshot: cloudstack sends a command to xenserver >>>> hypervisor, copy the snapshot to NFS storage, then immediately, >>>> upload the snapshot from NFS storage into S3. The snapshot is in >>>> Backedup state, not only means the snapshot is in NFS storage, but = also >> means it's stored on S3. >>>>>=20 >>>>> 1.6 Create volume from snapshot: download the snapshot and it's >> parent >>>> snapshots from S3 into NFS storage, then coalesce and vdi.copy the >>>> volume from NFS to primary storage. >>>>>=20 >>>>>=20 >>>>>=20 >>>>> 2. Then let's talk about how it works on object_store: >>>>> If S3 is not used, there is ZERO change from master branch. How = the >>>>> NFS >>>> secondary storage works before, is the same on object_store. >>>>> If S3 is used, and NFS cache storage used also(which is by = default): >>>>> 2.1 Register a template/ISO: the template/ISO are directly = uploaded >>>>> to S3, >>>> there is no extra copy to NFS storage. When the template is in = "Ready" >> state, >>>> means the template is stored on S3. It implies = that: the template >> is >>>> immediately available in the region as soon as it's in Ready State. >>>> And admin can clearly knows the status of template on S3, what's >>>> percentage of the uploading, is it failed or succeed? Also if >>>> register template failed for some reason, admin can issue the >>>> register template command again. I would say the change of how to >>>> register template into S3 is far better than what we did on master = branch. >>>>> 2.2 Backup snapshot: it's same as master branch, sends a command = to >>>> xenserver host, copy the snapshot into NFS, then upload to S3. >>>>> 2.3 Create volume from snapshot: it's the same as master branch, >>>> download snapshot and it's parent snaphots from S3 into NFS, then >>>> copy it from NFS to primary storage. >>>>> =46rom above few typical usage cases, you may understand how S3 = and >>>>> NFS >>>> cache storage is used, and what's difference between object_store >>>> branch and master branch: basically, we only change the way how to >>>> register a template, nothing else. >>>>> If S3 is used, and no NFS cache storage is used(it's possible, >>>>> depends on >>>> which datamotion strategy is used): >>>>> 2.4 Register a template/ISO: it's the same as 2.1 >>>>> 2.5 Backup snapshot: export the snapshot from primary storage = into >>>>> S3 >>>> directly >>>>> 2.6 Create volume from snapshot: download snapshots from S3 into >>>> primary storage directly, then coalesce and create volume from it. >>>>>=20 >>>>> Hope above explanation will tell the truth how the system >>>>> works on >>>> object_store, and clarify the misconception/misunderstanding about >>>> object_store branch. Even the change is huge, we still maintain the >>>> back compatibility. If you don't want to use S3, only want to >>>> existing NFS storage, it's definitely OK, it works the same as >>>> before. If you want to use S3, we provide a better S3 = implementation >>>> when registering template/ISO. If you want to use S3 without NFS >>>> storage, that's also definitely OK, the framework is quite = flexible to >> accommodate different solutions. >>>>>=20 >>>>> Ok, let's talk about the NFS storage cache issues. >>>>> The issue about NFS cache storage is discussed in several threads, >>>>> back and >>>> forth. All in all, the NFs cache storage is only one usage case out >>>> of three usage cases supported by object_store branch. It's not >>>> something that if it has issue, then everything doesn't work. >>>>> In above 2.2 and 2.3, it shows how the NFS cache storage is = involved >>>>> during >>>> snapshot related operations. The complains about there is no aging >>>> policy, no capacity planner for NFS cache storage, is happened when >>>> download a snapshot from S3 into NFS, or copy a snapshot from = primary >>>> storage into NFS, or download template from S3 into NFS. Yes, it's = an >>>> issue, the NFS cache storage can be used out, if there is no = capacity >>>> planner, and no aging out policy. But can it be fixed? Is it a = design issue? >>>>> Let's talk the code: Here is the code related to NFS cache = storage, >>>>> not much, only one class depends on NFS cache storage: >>>>> = https://git-wip-us.apache.org/repos/asf?p=3Dcloudstack.git;a=3Dblob;f=3Den= >>>>> gi >>>>>=20 >>>>=20 >> ne/storage/datamotion/src/org/apache/cloudstack/storage/motion/Ancien >>>> t >>>>>=20 >>>>=20 >> DataMotionStrategy.java;h=3Da01d2d30139f70ad8c907b6d6bc9759d47dcc2d6;h >>>> b=3D >>>>> refs/heads/object_store Take copyVolumeFromSnapshot as example, >>>> which >>>>> will be called when create Volume from snapshot, if first calls >>>>> cacheSnapshotChain, which will call cacheMgr.createCacheObject to >>>>> download the snapshot into NFs cache storage. >>>>> StorageCacheManagerImpl-> createCacheObject is the only place to >>>>> create objects on NFs cache storage, the code is at >>>>> = https://git-wip-us.apache.org/repos/asf?p=3Dcloudstack.git;a=3Dblob;f=3Den= >>>>> gi >>>>>=20 >>>>=20 >> ne/storage/cache/src/org/apache/cloudstack/storage/cache/manager/Stor >>>> a >>>>>=20 >>>>=20 >> geCacheManagerImpl.java;h=3Dcb5ea106fed3e5d2135dca7d98aede13effcf7d9; >>>> hb=3D >>>>> refs/heads/object_store In createCacheObject, it will first find = out >>>>> a cache storage, in case there are multiple cache storages = available in a >> scope: >>>>> DataStore cacheStore =3D this.getCacheStorage(scope); = getCacheStorage >>>>> will call StorageCacheAllocator to find out a proper NFS cache >>>>> storage. So >>>> StorageCacheAllocator is the place to choose NFS cache storage = based >>>> on certain criteria, the current implementation only randomly = choose >>>> one of them, we can add a new allocator algorithm, based on = capacity etc, >> etc. >>>>> Regarding capacity reservation, there is already a table, called >>>> op_host_capacity which has entry for NFS secondary storage, we can >>>> reuse this entry to store capacity information about NFS cache >>>> storages(such as, total size, available/used capacity etc). So when >>>> every call createCacheObject, we can call StorageCacheAllocator to >>>> find out a proper NFS storage based on first fit criteria, then >>>> increase used capacity in op_host_capacity table. If the create = cache >> object failed, return the capacity to op_host_capacity. >>>>>=20 >>>>> Regarding the aging out policy, we can start a background thread = on >>>>> mgt >>>> server, which will scan all the objects created on NFS cache >>>> storage(the tables called: snapshot_store_ref, template_store_ref, >>>> volume_store_ref), each entry of these tables has a column called: >>>> updated, every time, when the object's state is changed, the = "updated" >> column will be got updated also. >>>> When the object's state is changed? Every time, when the object is >>>> used in some contexts(such as copy the snapshot on NFS cache = storage >>>> into somewhere), the object's state will be changed accordingly, >>>> such as "Copying", means the object is being copied to some place, >>>> which is exactly the information we need to implement LRU = algorithm. >>>>>=20 >>>>> How do you guys think about the fix? If you have better solution, >>>>> please let >>>> me know. >>>>>=20 >>>>>=20 >>>=20 >=20