couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Newson <rnew...@apache.org>
Subject Re: Chunked input for multipart/related requests
Date Wed, 04 Dec 2013 11:41:10 GMT
I will try to review this this week if I can. I've worked in the
attachment code area before. The idea behind the speedup makes sense
to me but given the potential for data loss / corruption it'll need
careful review before it can go in. One thing to note is that the
replicator will verify the md5 of attachments as they're transferred,
if that helps with your testing.

B.


On 4 December 2013 10:06, Nick North <north.n@gmail.com> wrote:
> As mentioned the other day, I'm hoping to add CouchDb support for chunked
> HTTP requests that contain a document and attachments as a single
> multipart/related MIME request, and I'm hoping the group can advise me on
> the best coding direction. Apologies in advance for the length and detail
> of the email, but there doesn't seem to be a shorter way to ask the
> question with a sensible amount of background.
>
> Parsing multipart requests happens
> in couch_httpd:parse_multipart_request/3. This function scans the request
> for the MIME boundary string, reading 4KB blocks of data as needed. The
> pieces of data between boundary strings are passed to callback functions
> for further processing. The function to read the next block of data is an
> argument to parse_multipart_request called DataFun; it returns the data
> block plus the function to be used as the next DataFun. I think of this as
> a pull-based approach: data is pulled from the request as needed, with the
> pull returning some data and a new pull function.
>
> The natural extension to handle chunked requests would be to provide an
> improved DataFun that can grab the next 4KB block from either a chunked or
> an unchunked request. So I looked for existing support for chunked requests
> that could be reused. The chunked equivalent of the couch_httpd:recv/2
> function that's used to pull 4KB blocks is the couch_httpd:recv_chunked/4
> function. This calls the Mochiweb stream_body/3 function which, it
> transpires, was created for use in CouchDb. However, this differs in
> philosophy from the recv function: while recv just hands back a block of
> data, stream_body reads the whole of the request and calls a ChunkFun
> parameter on each block of data that it reads. I think of this as a
> push-based approach: the entire stream is read and pushed into a callback
> function, one block at a time.
>
> I can think of three ways to fix the mismatch between the pull and
> push-based approaches and provide chunked multipart support:
>
>    1. Rework parse_multipart_request to be push-based. This would allow
>    reuse of stream_body, but at the cost of turning existing code inside out
>    to fit with its push approach.
>    2. Create a pull-based version of stream_body and probably try to get in
>    incorporated into Mochiweb. But having two similar versions of the same
>    code like this doesn't feel right.
>    3. Convert stream_body from push-based to pull-based by spawning it in a
>    new process that sends each block of data back to the
>    parse_multipart_request DataFun and then blocks until the message is
>    acknowledged. The DataFun receives the data when it needs to fetch the next
>    block, and then sends an acknowledgement.
>
> The third option feels neatest and is my preferred route. But my ignorance
> of Erlang means that I don't know whether this is potentially expensive.
> While a new process is very cheap, it would mean that all the request data
> is copied from that process to parse_multipart_request, and I don't know if
> that is very costly. That sort of copying already goes on
> in couch_doc:doc_from_multi_part_stream where the parser is spawned off and
> copies each document and attachment back to the parent process but I don't
> know if that means the copying is cheap, or if it's an unavoidable evil
> that shouldn't be reproduced elsewhere.
>
> I'd really appreciate any advice that the group can give me on the best
> option to follow, and why, or suggestions for options that I've missed
> altogether. Thanks in advance for your help,
>
> Nick

Mime
View raw message