couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Candler <B.Cand...@pobox.com>
Subject Re: standalone attachments and content-encoding header
Date Tue, 23 Mar 2010 13:04:31 GMT
On Tue, Mar 23, 2010 at 11:53:38AM +0000, Filipe David Manana wrote:
> If you upload attachment A in uncompressed form, the "length" field in the
> attachment stub will match the length of the attachment in uncompressed
> form. On the other hand, if you upload that same attachment in compressed
> form, that "length" field will match the length of the attachment in
> compressed form.

That would be OK if the first had no content_encoding field stored, and the
latter had content_encoding: "gzip".  In both cases the content_length would
be an accurate reflection of what was stored, in the form it was stored.

(It might be a bit odd that two clients could upload the 'same' document,
one encoded and one not, and you'd see them differently.  It depends whether
you think it's couchdb's job to normalise the documents, or just store what
it's given)

> Uncompressing the attachment just to calculate its identity length seems a
> bit heavy, no?

Well yes, but it's better than just being wrong(*) as couchdb is at the
moment.

A simpler solution would be for couchdb to reject all uploads which have any
Content-Encoding: header with a 415.  That's correct but brutal, and
asymmetrical with attachment downloading where a request with
"Accept-Encoding: gzip" can return you a "Content-Encoding: gzip".

If it's going to accept encodings on upload then it either needs to store
the content-encoding (and return it on a subsequent GET), or normalise to
identity form, which involves decompressing [if only to get the length] but
is otherwise simple and unsurprising.

If it's going to store the content-encoding then it needs to be decided how
this interacts with the transparent gzipping to disk which (usually) occurs
when uploading attachments.

One option is to reveal it: that is, the user could upload a document and
then subsequently see
  content_encoding: "gzip"
  content_length: <some smaller value>
That's an API change and might be surprising.

But if not, there might be two different ways a gzipped doc could be stored
on disk; either explicitly (uploaded with Content-Encoding: gzip) or
implicitly (uploaded as identity, couchdb decided to gzip it). Normalising
avoids this issue.

Incidentally, I don't think couchdb *always* gzips attachments. I remember
reading a long discussion on a ticket about heuristics to avoid
re-compressing anything already compressed, or inherently incompressible.

Also: the same issue arises if we decide to store content_md5 (which would
be very useful in determining if an attachment has changed).

B.

(*) As shown by:

$ ls -l /usr/share/file/magic
-rw-r--r-- 1 root root 544773 2009-05-13 18:35 /usr/share/file/magic
$ cat /usr/share/file/magic | gzip -9 > magic.gz
$ ls -l magic.gz 
-rw-r--r-- 1 brian brian 151706 2010-03-23 12:35 magic.gz
$ curl -X PUT -d'{}' http://127.0.0.1:5984/testdb/foo
{"ok":true,"id":"foo","rev":"1-967a00dff5e02add41819138abb3284d"}
$ curl -v -X PUT --data-binary @magic.gz -H "Content-Encoding: gzip" http://127.0.0.1:5984/testdb/foo/att?rev=1-967a00dff5e02add41819138abb3284d
* About to connect() to 127.0.0.1 port 5984 (#0)
*   Trying 127.0.0.1... connected
* Connected to 127.0.0.1 (127.0.0.1) port 5984 (#0)
> PUT /testdb/foo/att?rev=1-967a00dff5e02add41819138abb3284d HTTP/1.1
> User-Agent: curl/7.19.5 (x86_64-pc-linux-gnu) libcurl/7.19.5 OpenSSL/0.9.8g zlib/1.2.3.3
libidn/1.15
> Host: 127.0.0.1:5984
> Accept: */*
> Content-Encoding: gzip
> Content-Length: 151706
> Content-Type: application/x-www-form-urlencoded
> Expect: 100-continue
> 
< HTTP/1.1 100 Continue
< HTTP/1.1 201 Created
< Server: CouchDB/0.11.0b912a3e00-git (Erlang OTP/R13B)
< Location: http://127.0.0.1:5984/testdb/foo/att
< Date: Tue, 23 Mar 2010 12:37:45 GMT
< Content-Type: text/plain;charset=utf-8
< Content-Length: 66
< Cache-Control: must-revalidate
< 
{"ok":true,"id":"foo","rev":"2-4a6c43c3c8b26c7fdbdc956c0d2477a7"}
* Connection #0 to host 127.0.0.1 left intact
* Closing connection #0
$ curl http://127.0.0.1:5984/testdb/foo
{"_id":"foo","_rev":"2-4a6c43c3c8b26c7fdbdc956c0d2477a7","_attachments":{"att":{"content_type":"application/x-www-form-urlencoded","revpos":2,"length":151706,"stub":true}}}
$ curl -v http://127.0.0.1:5984/testdb/foo/att | wc -c
* About to connect() to 127.0.0.1 port 5984 (#0)
*   Trying 127.0.0.1... connected
* Connected to 127.0.0.1 (127.0.0.1) port 5984 (#0)
> GET /testdb/foo/att HTTP/1.1
> User-Agent: curl/7.19.5 (x86_64-pc-linux-gnu) libcurl/7.19.5 OpenSSL/0.9.8g zlib/1.2.3.3
libidn/1.15
> Host: 127.0.0.1:5984
> Accept: */*
> 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0< HTTP/1.1
200 OK
< Transfer-Encoding: chunked
< Server: CouchDB/0.11.0b912a3e00-git (Erlang OTP/R13B)
< ETag: "2-4a6c43c3c8b26c7fdbdc956c0d2477a7"
< Date: Tue, 23 Mar 2010 12:51:43 GMT
< Content-Type: application/x-www-form-urlencoded
< Cache-Control: must-revalidate
< 
{ [data not shown]
100  148k    0  148k    0     0  12.6M      0 --:--:-- --:--:-- --:--:-- 14.4M* Connection
#0 to host 127.0.0.1 left intact

* Closing connection #0
151706

Mime
View raw message