couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Newson <rnew...@apache.org>
Subject Re: Chunked input for multipart/related requests
Date Wed, 04 Dec 2013 11:41:47 GMT
specifically your "speedup" patch, sorry for thread jumping.

On 4 December 2013 11:41, Robert Newson <rnewson@apache.org> wrote:
> I will try to review this this week if I can. I've worked in the
> attachment code area before. The idea behind the speedup makes sense
> to me but given the potential for data loss / corruption it'll need
> careful review before it can go in. One thing to note is that the
> replicator will verify the md5 of attachments as they're transferred,
> if that helps with your testing.
>
> B.
>
>
> On 4 December 2013 10:06, Nick North <north.n@gmail.com> wrote:
>> As mentioned the other day, I'm hoping to add CouchDb support for chunked
>> HTTP requests that contain a document and attachments as a single
>> multipart/related MIME request, and I'm hoping the group can advise me on
>> the best coding direction. Apologies in advance for the length and detail
>> of the email, but there doesn't seem to be a shorter way to ask the
>> question with a sensible amount of background.
>>
>> Parsing multipart requests happens
>> in couch_httpd:parse_multipart_request/3. This function scans the request
>> for the MIME boundary string, reading 4KB blocks of data as needed. The
>> pieces of data between boundary strings are passed to callback functions
>> for further processing. The function to read the next block of data is an
>> argument to parse_multipart_request called DataFun; it returns the data
>> block plus the function to be used as the next DataFun. I think of this as
>> a pull-based approach: data is pulled from the request as needed, with the
>> pull returning some data and a new pull function.
>>
>> The natural extension to handle chunked requests would be to provide an
>> improved DataFun that can grab the next 4KB block from either a chunked or
>> an unchunked request. So I looked for existing support for chunked requests
>> that could be reused. The chunked equivalent of the couch_httpd:recv/2
>> function that's used to pull 4KB blocks is the couch_httpd:recv_chunked/4
>> function. This calls the Mochiweb stream_body/3 function which, it
>> transpires, was created for use in CouchDb. However, this differs in
>> philosophy from the recv function: while recv just hands back a block of
>> data, stream_body reads the whole of the request and calls a ChunkFun
>> parameter on each block of data that it reads. I think of this as a
>> push-based approach: the entire stream is read and pushed into a callback
>> function, one block at a time.
>>
>> I can think of three ways to fix the mismatch between the pull and
>> push-based approaches and provide chunked multipart support:
>>
>>    1. Rework parse_multipart_request to be push-based. This would allow
>>    reuse of stream_body, but at the cost of turning existing code inside out
>>    to fit with its push approach.
>>    2. Create a pull-based version of stream_body and probably try to get in
>>    incorporated into Mochiweb. But having two similar versions of the same
>>    code like this doesn't feel right.
>>    3. Convert stream_body from push-based to pull-based by spawning it in a
>>    new process that sends each block of data back to the
>>    parse_multipart_request DataFun and then blocks until the message is
>>    acknowledged. The DataFun receives the data when it needs to fetch the next
>>    block, and then sends an acknowledgement.
>>
>> The third option feels neatest and is my preferred route. But my ignorance
>> of Erlang means that I don't know whether this is potentially expensive.
>> While a new process is very cheap, it would mean that all the request data
>> is copied from that process to parse_multipart_request, and I don't know if
>> that is very costly. That sort of copying already goes on
>> in couch_doc:doc_from_multi_part_stream where the parser is spawned off and
>> copies each document and attachment back to the parent process but I don't
>> know if that means the copying is cheap, or if it's an unavoidable evil
>> that shouldn't be reproduced elsewhere.
>>
>> I'd really appreciate any advice that the group can give me on the best
>> option to follow, and why, or suggestions for options that I've missed
>> altogether. Thanks in advance for your help,
>>
>> Nick

Mime
View raw message