couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <>
Subject progress on streaming attachments during replication
Date Fri, 13 Feb 2009 00:01:58 GMT
Hi devs, I spent a good bit of time over the last two days on  
attachment replication.  I started with pull replication since I had a  
pretty clear idea of what I wanted to do there:

a) stop inlining attachments in the JSON document body

b) map over the attachment stubs in the source document, submitting an  
async HTTP request for each,

c) replace the stub with a function that's streaming-API-compatible  
(meaning it looks like F() -> binary() and can be called repeatedly  
until all data has been returned).  In this case the function is just  
a wrapper around a receive statement.  Damien's streaming attachment  
API takes it from there.

Well, I got all that written and working, but I ran into some trouble  
with ibrowse's async HTTP requests:

* it doesn't look like ibrowse supports any flow control.  You tell it  
to stream a response to a process, and it just opens up the firehose  
and sends messages to that process until the response is complete.

* ibrowse sends a message for each received packet.  I tried this code  
out with a 32MB attachment and got 20k messages in my mailbox.   
Combine that with the lack of flow control and the writer mailbox  
blows up pretty quickly.

* Less important, but ibrowse sends the data as lists of bytes rather  
than binaries.  Seems like a lot of unnecessary copying to me.

Anyone know of a mailing list for ibrowse, or do we just email  
Chandrashekhar directly?  It'd be good to get some confirmation from  
him on this.

I also took a look at inets' async support and found that it worked  
quite a bit better -- it has flow control with the {self,once} option,  
it sends 1 message / chunk (CouchDB default chunk size = 1MB), and it  
sends that message as a binary (so no copying).

However, inets also had some problems.  I saw that the VM memory usage  
still climbed pretty quickly when replicating a big attachment, and  
etop told me that it was all in binaries.  I tried process_info(Pid,  
binary) and found that the httpc_handler process spawned for that  
attachment request was keeping a reference to each binary chunk.  At  
least, that's what it looked like to me -- I didn't find any  
documentation on the BinInfo tuples returned by process_info() so I  
took a guess that they were {UniqueID, Size, NRefs}.

I was able to replicate GB-sized attachments with the inets async  
code.  Unfortunately, the Erlang VM took all my free memory and had a  
VSIZE of ~500 MB when it finished.  I tried tossing garbage_collect()  
in the couch_db, couch_stream, and couch_file processes, but it seems  
the problem is really in the inets httpc_handler.  Nothing else was  
keeping a reference to the old binaries.  Anybody know of additional  
tricks for debugging Erlang memory utilization in general and binary  
reference counting in particular?

Sorry for the long post.  Best, Adam

View raw message