couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antony Blakey <antony.bla...@gmail.com>
Subject Re: Attachment Replication Problem
Date Sat, 16 May 2009 00:16:10 GMT

On 15/05/2009, at 2:44 PM, Antony Blakey wrote:

> I have a 3.5G Couchdb database, consisting of 1000 small documents,  
> each with many attachments (0-30 per document), each attachment  
> varying wildly in size (1K..10M).
>
> To test replication I am running a server on my MBPro and another  
> under Ubuntu in VMWare on the same machine. I'm testing using a pure  
> trunk.
>
> Doing a pull-replicate from OSX to Linux fails to complete. The  
> point at which it fails is constant. I've added some debug logs into  
> couch_rep/attachment_loop like this: http://gist.github.com/112070  
> and made the suggested "couch_util:should_flush(1000)" mod to try  
> and guarantee progress (but to no avail). The debug output shows  
> this: http://gist.github.com/112069 and the document it seems to  
> fail on is this: http://gist.github.com/112074 . I'm only just  
> starting to look at this - any pointers would be appreciated.

I put some more logging in attachment_loop, specifically this:

         {ibrowse_async_response, ReqId, Data} ->
             ?LOG_DEBUG("ATTACHMENT_LOOP: ibrowse_async_response Data  
A ~p", [Url]),
             receive {From, gimme_data} -> From ! {self(), Data} end,
             ?LOG_DEBUG("ATTACHMENT_LOOP: ibrowse_async_response Data  
B ~p", [Url]),
             attachment_loop(ReqId);

The result of this is to see an enormous number of 'Data A' logs  
without a corresponding 'Data B'. This happens because  
make_attachment_stub_receiver uses a promise to read the data, created  
like this:

         ResponseCode >= 200, ResponseCode < 300 ->
             % the normal case
             Pid ! {self(), continue},
             %% this function goes into the streaming attachment code.
             %% It gets executed by the replication gen_server, so it  
can't
             %% be the one to actually receive the ibrowse data.
             {ok, fun() ->
                 Pid ! {self(), gimme_data},
                 receive {Pid, Data} -> Data end
             end};

It seems that the promise is forced (e.g. the data read) only when the  
documents are checkpointed. If, as in my case, you have lots of small  
documents with many attachments, this results in massive numbers of  
open connections to download the attachments, each blocked reading the  
first bit of data from the first chunk, because the checkpointing  
occurs by default after 10MB of document data has been read, excluding  
attachments. In any case purely using size as a trigger won't work if  
you have lots of small documents with lots of small attachments. It  
would seem that the checkpointing, and hence forcing of the http- 
reading promises needs to also account for the number of promises.

To overcome this problem I used couch_util:should_flush(1) to ensure  
that each document would be checkpointed, but that simply demonstrated  
that this isn't the cause of the 100% repeatable replication hang that  
I have. Now I get a log trace like this: http://gist.github.com/112512  
(ignoring the crap at the end of each log statement, which is my  
incompleted attempt to link each log to the associated url).

Anyone with any thoughts?

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

What can be done with fewer [assumptions] is done in vain with more
   -- William of Ockham (ca. 1285-1349)




Mime
View raw message