On Feb 12, 2009, at 7:01 PM, Adam Kocoloski wrote:
> Hi devs, I spent a good bit of time over the last two days on
> attachment replication. I started with pull replication since I had
> a pretty clear idea of what I wanted to do there:
>
> a) stop inlining attachments in the JSON document body
>
> b) map over the attachment stubs in the source document, submitting
> an async HTTP request for each,
>
> c) replace the stub with a function that's streaming-API-compatible
> (meaning it looks like F() -> binary() and can be called repeatedly
> until all data has been returned). In this case the function is
> just a wrapper around a receive statement. Damien's streaming
> attachment API takes it from there.
>
> Well, I got all that written and working, but I ran into some
> trouble with ibrowse's async HTTP requests:
>
> * it doesn't look like ibrowse supports any flow control. You tell
> it to stream a response to a process, and it just opens up the
> firehose and sends messages to that process until the response is
> complete.
>
> * ibrowse sends a message for each received packet. I tried this
> code out with a 32MB attachment and got 20k messages in my mailbox.
> Combine that with the lack of flow control and the writer mailbox
> blows up pretty quickly.
>
> * Less important, but ibrowse sends the data as lists of bytes
> rather than binaries. Seems like a lot of unnecessary copying to me.
>
> Anyone know of a mailing list for ibrowse, or do we just email
> Chandrashekhar directly? It'd be good to get some confirmation from
> him on this.
>
> I also took a look at inets' async support and found that it worked
> quite a bit better -- it has flow control with the {self,once}
> option, it sends 1 message / chunk (CouchDB default chunk size =
> 1MB), and it sends that message as a binary (so no copying).
>
> However, inets also had some problems. I saw that the VM memory
> usage still climbed pretty quickly when replicating a big
> attachment, and etop told me that it was all in binaries. I tried
> process_info(Pid, binary) and found that the httpc_handler process
> spawned for that attachment request was keeping a reference to each
> binary chunk. At least, that's what it looked like to me -- I
> didn't find any documentation on the BinInfo tuples returned by
> process_info() so I took a guess that they were {UniqueID, Size,
> NRefs}.
>
> I was able to replicate GB-sized attachments with the inets async
> code. Unfortunately, the Erlang VM took all my free memory and had
> a VSIZE of ~500 MB when it finished. I tried tossing
> garbage_collect() in the couch_db, couch_stream, and couch_file
> processes, but it seems the problem is really in the inets
> httpc_handler. Nothing else was keeping a reference to the old
> binaries. Anybody know of additional tricks for debugging Erlang
> memory utilization in general and binary reference counting in
> particular?
>
> Sorry for the long post. Best, Adam
Woot! This is awesome Adam. Sorry, I don't have any answers on the
http client stuff. Maybe we should check on the Erlang list for
available options.
-Damien
|