couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Katz <dam...@apache.org>
Subject Re: progress on streaming attachments during replication
Date Fri, 13 Feb 2009 00:24:05 GMT

On Feb 12, 2009, at 7:01 PM, Adam Kocoloski wrote:

> Hi devs, I spent a good bit of time over the last two days on  
> attachment replication.  I started with pull replication since I had  
> a pretty clear idea of what I wanted to do there:
>
> a) stop inlining attachments in the JSON document body
>
> b) map over the attachment stubs in the source document, submitting  
> an async HTTP request for each,
>
> c) replace the stub with a function that's streaming-API-compatible  
> (meaning it looks like F() -> binary() and can be called repeatedly  
> until all data has been returned).  In this case the function is  
> just a wrapper around a receive statement.  Damien's streaming  
> attachment API takes it from there.
>
> Well, I got all that written and working, but I ran into some  
> trouble with ibrowse's async HTTP requests:
>
> * it doesn't look like ibrowse supports any flow control.  You tell  
> it to stream a response to a process, and it just opens up the  
> firehose and sends messages to that process until the response is  
> complete.
>
> * ibrowse sends a message for each received packet.  I tried this  
> code out with a 32MB attachment and got 20k messages in my mailbox.   
> Combine that with the lack of flow control and the writer mailbox  
> blows up pretty quickly.
>
> * Less important, but ibrowse sends the data as lists of bytes  
> rather than binaries.  Seems like a lot of unnecessary copying to me.
>
> Anyone know of a mailing list for ibrowse, or do we just email  
> Chandrashekhar directly?  It'd be good to get some confirmation from  
> him on this.
>
> I also took a look at inets' async support and found that it worked  
> quite a bit better -- it has flow control with the {self,once}  
> option, it sends 1 message / chunk (CouchDB default chunk size =  
> 1MB), and it sends that message as a binary (so no copying).
>
> However, inets also had some problems.  I saw that the VM memory  
> usage still climbed pretty quickly when replicating a big  
> attachment, and etop told me that it was all in binaries.  I tried  
> process_info(Pid, binary) and found that the httpc_handler process  
> spawned for that attachment request was keeping a reference to each  
> binary chunk.  At least, that's what it looked like to me -- I  
> didn't find any documentation on the BinInfo tuples returned by  
> process_info() so I took a guess that they were {UniqueID, Size,  
> NRefs}.
>
> I was able to replicate GB-sized attachments with the inets async  
> code.  Unfortunately, the Erlang VM took all my free memory and had  
> a VSIZE of ~500 MB when it finished.  I tried tossing  
> garbage_collect() in the couch_db, couch_stream, and couch_file  
> processes, but it seems the problem is really in the inets  
> httpc_handler.  Nothing else was keeping a reference to the old  
> binaries.  Anybody know of additional tricks for debugging Erlang  
> memory utilization in general and binary reference counting in  
> particular?
>
> Sorry for the long post.  Best, Adam

Woot! This is awesome Adam. Sorry, I don't have any answers on the  
http client stuff. Maybe we should check on the Erlang list for  
available options.

-Damien

Mime
View raw message