incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Stevens (Gmail)" <wickedg...@gmail.com>
Subject Re: Upload speed for large attachments
Date Wed, 08 Jun 2011 20:16:37 GMT
Tilgovi on IRC asked me to open an issue:

https://issues.apache.org/jira/browse/COUCHDB-1192

Cheers,
Eli

On Wed, Jun 8, 2011 at 1:36 AM, Eli Stevens (Gmail)
<wickedgrey@gmail.com> wrote:
> Running the following code on a macbook pro, using CouchDBX 1.0.2
> (everything local), we're seeing the following output when trying to
> attach a file with 10MB of random data:
>
> Code: https://gist.github.com/bc0c36f36be0c85e2a36 (code included in full below)
> Output:
>
> Using curl: 0.168450117111
> Using put_attachment: 0.309157133102
> post time: 2.5557808876
> Using multipart: 2.61283898354
> Encoding base64: 0.0497629642487
> Updating: 5.0550069809
>
> Server log: https://gist.github.com/a80a495fd35049ff871f (there's a
> HEAD/DELETE/PUT/GET cycle that's just cleanup)
>
> The calls in question are:
>
> Using curl: 0.168450117111
> 1> [info] [<0.27828.7>] 127.0.0.1 - - 'PUT'
> /benchmark_entity/bigfile/bigfile/bigfile.gz?rev=78-db58ded2899c5546e349feb5a8c0eee4
> 201
>
> Using put_attachment: 0.309157133102
> 1> [info] [<0.27809.7>] 127.0.0.1 - - 'PUT'
> /benchmark_entity/bigfile/smallfile?rev=81-c538b38a8463952f0136143cfa49e9fa
> 201
>
> Using multipart: 2.61283898354 (post time: 2.5557808876)
> 1> [info] [<0.27809.7>] 127.0.0.1 - - 'POST' /benchmark_entity/bigfile 201
>
> Updating: 5.0550069809
> 1> [info] [<0.27809.7>] 127.0.0.1 - - 'POST' /benchmark_entity/_bulk_docs 201
>
> Profiling our code shows 1.5 sec of CPU usage in our code (which
> covers setup / cleanup code that's not included in the times above),
> and 11.8 sec of total run time, which roughly matches up with the
> PUT/POST times above.  Basically, I feel pretty confident that the
> bulk of the times above are not in our client code, and are instead
> due to couchdb's handling time.
>
> Why is the form/multipart handler so much slower than using a bare PUT
> on the attachment?  Why is the base64 approach even slower?  Is it due
> to bandwidth issues, couchdb CPU usage...?
>
> Thanks for any help,
> Eli
>
> Full code from: https://gist.github.com/bc0c36f36be0c85e2a36
>
> import base64
> import contextlib
> import cStringIO
> import subprocess
> import time
>
> import couchdb
> import couchdb.json
> import couchdb.multipart
>
> @contextlib.contextmanager
> def stopwatch(m=''):
>    t0=time.time()
>    yield
>    tdiff=time.time() - t0
>    if m:
>        print '{}: {}'.format(m, tdiff)
>    else:
>        print tdiff
>
> def reset(d):
>    try:
>        del d['bigfile']
>    except couchdb.http.ResourceNotFound:
>        pass
>    d['bigfile'] = {'foo': 'bar'}
>    return d['bigfile']
>
> s = couchdb.Server()
> d = s['benchmark_entity']
>
> fn = '/tmp/bigfile.gz'
> fn = '/tmp/smallfile'
>
> doc = reset(d)
> with stopwatch('Using curl'):
>    p = subprocess.Popen([
>        'curl',
>        '-X', 'PUT',
>        'http://localhost:5984/benchmark_entity/{}/bigfile/bigfile.gz?rev={}'.format(doc.id,
> doc.rev),
>        '-d', '@{}'.format(fn),
>        '-H', 'Content-Type: application/gzip'
>        ])
>    p.wait()
>
> doc = reset(d)
> with open(fn, 'r') as f:
>    with stopwatch('Using put_attachment'):
>        d.put_attachment(doc, f)
>
> doc = reset(d)
> with open(fn, 'r') as f:
>    content_name = 'bigfile.gz'
>    content = f.read()
>    content_type = 'application/gzip'
>    with stopwatch('Using multipart'):
>        fileobj = cStringIO.StringIO()
>
>        with couchdb.multipart.MultipartWriter(fileobj, headers=None,
> subtype='form-data') as mpw:
>            mime_headers = {'Content-Disposition': '''form-data; name="_doc"'''}
>            mpw.add('application/json', couchdb.json.encode(doc), mime_headers)
>
>            mime_headers = {'Content-Disposition': '''form-data;
> name="_attachments"; filename="{}"'''.format(content_name)}
>            mpw.add(content_type, content, mime_headers)
>
>        header_str, blank_str, body = fileobj.getvalue().split('\r\n', 2)
>
>        http_headers = {'Referer': d.resource.url, 'Content-Type':
> header_str[len('Content-Type: '):]}
>        params = {}
>        t0 = time.time()
>        status, msg, data = d.resource.post(doc['_id'], body,
> http_headers, **params)
>        print 'post time: {}'.format(time.time() - t0)
>
> doc = reset(d)
> with open(fn, 'r') as f:
>    content_name = 'bigfile.gz'
>    content = f.read()
>    content_type = 'application/gzip'
>    with stopwatch('Encoding base64'):
>        doc['_attachments'] = {content_name: {'content_type':
> content_type, 'data': base64.b64encode(content)}}
>    with stopwatch('Updating'):
>        d.update([doc])
>

Mime
View raw message