cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Min Chen <min.c...@citrix.com>
Subject [DISCUSS]OBJECT_STORE branch design: Error handling in case of S3 as native secondary storage
Date Mon, 03 Jun 2013 18:07:20 GMT
Hi there,
This thread is to address John's comments about missing error handling in S3 as secondary
storage in object_store branch implementation. From previous merge email thread, I realize
that we may not explain clearly in FS how S3 should work in new object_store branch, so causing
several confusions. Let's make it clear here.

1. The goal of object_store branch is to make S3 serve as NATIVE secondary storage, not just
a backup device as NFS secondary storage in master branch. We want to lead people to believe
that their data (template, snapshot, volumes) are stored in S3 object store if they choose
S3 as their cloudstack secondary storage. When users register template to S3, we are directly
issuing S3 API to download template directly into S3 object store instead of downloading it
to NFS secondary storage and then syncing to S3 by schedule done by master branch. When we
tell users that their data is  READY on their S3 secondary storage, it really means that it
is ready to use from S3. Unlike this guarantee, in master, S3 as a backup device, snapshot
may only be ready on NFS secondary storage, not in S3 due to any network connection issues,
but we actually mislead users that their snapshot is ready on S3.

2. NFS cache only comes into picture when user choose S3 as their native secondary storage.
The data stored in NFS cache is really temporary and serve as an intermediate transfer stage
for CloudStack to manipulate data stored in S3, our design does not have any requirement that
these intermediate data has to be persist there in NFS cache forever to make CloudStack functional.
This is quite different from the role of NFS secondary storage for S3 in master branch, where
we have to keep data there in NFS secondary storage since we cannot guarantee that data is
READY on S3 due to background sync issue I will mention in a minute. Theoretically speaking,
we should be able to implement a simple LRU or FIFO cache algorithm (with the assumption that
we have proved 4.2 feature freeze extension vote) to age out old cache data without impacting
any of CloudStack functionality using S3. Not sure if this is true for NFS secondary storage
data for S3 in master branch, feels not based on my code understanding, but maybe I am just
ignorant and too new to this part of code in master.

3. We have to admit that in current object_store implementation, we only try the S3 operations
(put, get, etc) once and if it failed, and we just report error and user have to manually
retry. On this aspect, we definitely can make it better by adding some re-try mechanism based
on a global configured retry parameter. However, infinite retry in interacting with these
external devices is always a bad idea from my past experience. Also, we disagree with John's
comment about dropping previous  background sync process is "a step back from the current
Swift and S3 implementations present in 4.1.0". We agree that current master background sync
process relieves admin from manual retry in case of some S3 errors (BTW, some errors will
never recover even with background process, for example, capacity full), but it also caused
another severe drawback, that is,  give user misconception that their data is READY in S3,
but actually not. Here is a simple example, users take a snapshot on one zone and backup to
S3, based on S3 region-wide nature, it is very natural for them to think that they can immediately
restore this snapshot on another zone. However, for current master implementation, this may
fail. Due to S3 network connection issue at backup moment, snapshot may not be ready on S3,
and only stored in zone-wide NFS secondary storage. Another backup sync process is not kicked
in yet. If now users are trying to do restore action, it will doom failure in not finding
proper snapshot. In our opinion, enhancing current object_store implementation with some configured
retry logic should be a good compromise.

Thanks.
-min


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message