cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas O'Dowd <tpod...@cloudian.com>
Subject Re: Object based Secondary storage.
Date Fri, 07 Jun 2013 01:10:49 GMT
Hi guys,

The ETAG is an interesting subject. AWS currently maintains 2 different
types of ETAGS for objects that I know of.

  a) PUT OBJECT - assigned ETAG will be calculated from the MD5 checksum
of the data content that you are uploading. When uploading you should
also always set the Content-MD5 header so that AWS (or other S3 Stores)
can verify your MD5 checksum against what it receives. The ETAG for such
objects will be the MD5 checksum of the content for AWS but doesn't have
to be I guess for other S3 stores. What's important is that AWS will
reject your upload if the MD5 checksum it calculates is not the same as
your Content-MD5 header.

  b) MULTIPART OBJECTS - A multipart object is an object which is
uploaded using mulitple PUT requests each which uploads some part. Parts
can be uploaded out of order and in parallel so AWS cannot calculate the
MD5 checksum for the entire object without actually waiting until all
parts have been uploaded and finally reprocessing all the data. This
would be very heavy for various reasons so they don't do this. The ETAG
therefore can not be calculated from the MD5 checksum of the content
either. I don't know exactly how AWS calculates their ETAG for multipart
objects but the ETAG will always take the form of XXXXXXXX-YYY where the
X part looks like a regular MD5 checksum of sorts and the Y part is the
number of parts that made up the upload. Therefore you can always tell
that an object was uploaded using a multipart upload by checking its
ETAG ends with -YYY. This however may be only true for AWS - other S3
stores may do it differently. You should just treat the etag as opaque
really.

Some more best practices about multipart uploads.
1. Always calculate the MD5 checksum of each part and send the
Content-MD5 header. This way AWS can verify the content of each part as
you upload it.
2. Always retain the ETAG for each part as returned by the response of
each part upload. You should have an etag for each part you uploaded.
3. Refrain from asking the server for a list of parts in order to create
the final Multipart Upload complete request. Always use your list of
parts and your list of ETAGS (from point 2). The exception is when you
are doing recovery after some client crash.

The main reason for this is that AWS and most other S3 stores are based
on eventual consistency and the server may not always (but mostly does)
give you a correct list of parts. The Multipart upload complete request
allows you to drop parts also so if you ask the server for a list of
parts and it misses one temporarily, you may end up with an object that
is missing a part also.

Btw, shameless plug but Cloudian has very good compatibility with AWS
and has a community edition version that is free for up to 100TB. I'll
test against it but you may also like to. You can run it on a single
node with not much fuss. Feel free to ask me about it offline.

Anyway hope that helps,

Tom.

On Thu, 2013-06-06 at 22:57 +0000, Edison Su wrote:
> The Etag created by both RIAK CS and Amazon S3 seems a little bit different, in case
of multi part upload.
> 
> Here is the result I tested on both RIAK CS and Amazon S3, with s3cmd.
> Test environment:
> S3cmd: version: version 1.5.0-alpha1
> Riak cs:
> Name        : riak
> Arch        : x86_64
> Version     : 1.3.1
> Release     : 1.el6
> Size        : 40 M
> Repo        : installed
> From repo   : basho-products
> 
> The command I used to put:
> s3cmd put some-file s3://some-path --multipart-chunk-size-mb=100 -v -d
> 
> The etag created for the file, when using Riak CS is WxEUkiQzTWm_2C8A92fLQg==
> 
> EBUG: Sending request method_string='POST', uri='http://imagestore.s3.amazonaws.com/tmpl/1/1/routing-1/test?uploadId=kfDkh7Q_QCWN7r0ZTqNq4Q==',
headers={'content-length': '309', 'Authorization': 'AWS OYAZXCAFUC1DAFOXNJWI:xlkHI9tUfUV/N+Ekqpi7Jz/pbOI=',
'x-amz-date': 'Thu, 06 Jun 2013 22:54:28 +0000'}, body=(309 bytes)
> DEBUG: Response: {'status': 200, 'headers': {'date': 'Thu, 06 Jun 2013 22:40:09 GMT',
'content-length': '326', 'content-type': 'application/xml', 'server': 'Riak CS'}, 'reason':
'OK', 'data': '<?xml version="1.0" encoding="UTF-8"?><CompleteMultipartUploadResult
xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Location>http://imagestore.s3.amazonaws.com/tmpl/1/1/routing-1/test</Location><Bucket>imagestore</Bucket><Key>tmpl/1/1/routing-1/test</Key><ETag>kfDkh7Q_QCWN7r0ZTqNq4Q==</ETag></CompleteMultipartUploadResult>'}
> 
> While the etag created by Amazon S3 is: &quot;70e1860be687d43c039873adef4280f2-3&quot;
> 
> DEBUG: Sending request method_string='POST', uri='/fixes/icecake/systdfdfdfemvm.iso1?uploadId=vdkPSAtaA7g.fdfdfdfdf..iaKRNW_8QGz.bXdfdfdfdfdfkFXwUwLzRcG5obVvJFDvnhYUFdT6fYr1rig--',

> DEBUG: Response: {'status': 200, 'headers': {, 'server': 'AmazonS3', 'transfer-encoding':
'chunked', 'connection': 'Keep-Alive', 'x-amz-request-id': '8DFF5D8025E58E99', 'cache-control':
'proxy-revalidate', 'date': 'Thu, 06 Jun 2013 22:39:47 GMT', 'content-type': 'application/xml'},
'reason': 'OK', 'data': '<?xml version="1.0" encoding="UTF-8"?>\n\n<CompleteMultipartUploadResult
xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Location>http://fdfdfdfdfdfdf</Location>Key>fixes/icecake/systemvm.iso1</Key><ETag>&quot;70e1860be687d43c039873adef4280f2-3&quot;</ETag></CompleteMultipartUploadResult>'}
> 
> So the etag created on Amazon S3 has "-"(dash) in it, but there is only "_" (underscore)
on Riak cs. 
> 
> Do you know the reason? What should we need to do to make it compatible with Amazon S3
SDK?
> 
> > -----Original Message-----
> > From: John Burwell [mailto:jburwell@basho.com]
> > Sent: Thursday, June 06, 2013 2:03 PM
> > To: dev@cloudstack.apache.org
> > Subject: Re: Object based Secondary storage.
> > 
> > Min,
> > 
> > Are you calculating the MD5 or letting the Amazon client do it?
> > 
> > Thanks,
> > -John
> > 
> > On Jun 6, 2013, at 4:54 PM, Min Chen <min.chen@citrix.com> wrote:
> > 
> > > Thanks Tom. Indeed I have a S3 question that need some advise from
> > > some S3 experts. To support upload object > 5G, I have used
> > > TransferManager.upload to upload object to S3, upload went fine and
> > > object are successfully put to S3. However, later on when I am using
> > > "s3cmd get <object key>" to retrieve this object, I always got this exception:
> > >
> > > "MD5 signatures do not match: computed=Y, received="X"
> > >
> > > It seems that Amazon S3 kept a different Md5 sum for the multi-part
> > > uploaded object. We have been using Riak CS for our S3 testing. If I
> > > changed to not using multi-part upload and directly invoking S3
> > > putObject, I will not run into this issue. Do you have such experience
> > before?
> > >
> > > -min
> > >
> > > On 6/6/13 1:56 AM, "Thomas O'Dowd" <tpodowd@cloudian.com> wrote:
> > >
> > >> Thanks Min. I've printed out the material and am reading new threads.
> > >> Can't comment much yet until I understand things a bit more.
> > >>
> > >> Meanwhile, feel free to hit me up with any S3 questions you have. I'm
> > >> looking forward to playing with the object_store branch and testing
> > >> it out.
> > >>
> > >> Tom.
> > >>
> > >> On Wed, 2013-06-05 at 16:14 +0000, Min Chen wrote:
> > >>> Welcome Tom. You can check out this FS
> > >>>
> > >>>
> > https://cwiki.apache.org/confluence/display/CLOUDSTACK/Storage+Backu
> > >>> p+Obj
> > >>> ec
> > >>> t+Store+Plugin+Framework for secondary storage architectural work
> > >>> t+Store+Plugin+done
> > >>> in
> > >>> object_store branch.You may also check out the following recent
> > >>> threads regarding 3 major technical questions raised by community as
> > >>> well as our answers and clarification.
> > >>>
> > >>> http://mail-archives.apache.org/mod_mbox/cloudstack-
> > dev/201306.mbox/
> > >>> %3C77
> > >>> B3
> > >>>
> > 37AF224FD84CBF8401947098DD87036A76%40SJCPEX01CL01.citrite.net%3E
> > >>>
> > >>> http://mail-archives.apache.org/mod_mbox/cloudstack-
> > dev/201306.mbox/
> > >>> %3CCD
> > >>> D2
> > >>> 2955.3DDDC%25min.chen%40citrix.com%3E
> > >>>
> > >>> http://mail-archives.apache.org/mod_mbox/cloudstack-
> > dev/201306.mbox/
> > >>> %3CCD
> > >>> D2
> > >>> 300D.3DE0C%25min.chen%40citrix.com%3E
> > >>>
> > >>>
> > >>> That branch is mainly worked on by Edison and me, and we are at PST
> > >>> timezone.
> > >>>
> > >>> Thanks
> > >>> -min
> > >> --
> > >> Cloudian KK - http://www.cloudian.com/get-started.html
> > >> Fancy 100TB of full featured S3 Storage?
> > >> Checkout the Cloudian(r) Community Edition!
> > >>
> > >
> 

-- 
Cloudian KK - http://www.cloudian.com/get-started.html
Fancy 100TB of full featured S3 Storage?
Checkout the Cloudian® Community Edition!


Mime
View raw message