couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Katz <>
Subject Re: replication using _changes API
Date Fri, 12 Jun 2009 14:47:30 GMT

On Jun 12, 2009, at 8:59 AM, Adam Kocoloski wrote:

> Hi Damien, I'm not sure I follow.  My worry was that, if I built a  
> replicator which only queried _changes to get the list of updates,  
> I'd have to be prepared to process a very large response.  I thought  
> one smart way to process this response was to throttle the download  
> at the TCP level by putting the socket into passive mode.

You will have a very large response, but you can stream it, processing  
one line at a time, then you discard the line and process the next. As  
long as the writer is using a blocking socket and the reader is only  
reading as much data as necessary to process a line, you never need to  
store much of the data in memory on either side. But it seems the HTTP  
client is buffering the data as it comes in, perhaps unintentionally.

With TCP, the sending side will only send so much data before getting  
an ACK, acknowledgment that packets sent were actually received. When  
an ACK isn't received, the sender stops sending, and the TCP calls  
will block at the sender (or return an error if the socket is in non- 
blocking mode), until it gets a response or socket timeout.

So if you have a non-buffering reader and a blocking sender, then you  
can stream the data and only relatively small amounts of data are  
buffered at any time. The problem is the reader in the HTTP client  
isn't waiting for the data to be demanded at all, instead as soon as  
data comes in, it sends it to a receiving erlang process. Erlang  
processes never block to receive messages, so there is no limit to the  
amount of data buffered. So if the Erlang process can't process the  
data fast enough, it starts getting buffered in it's mailbox,  
consuming unlimited memory.

Assuming I understand the problem correctly, the way to fix it is to  
have the HTTP client not read the data until it's demanded by the  
consuming process. Then we are only using the default TCP buffers, not  
the Erlang message queues as a buffer, and the total amount of memory  
used at anytime is small.


> I agree that the HTTP client seems to be at fault, because the  
> option that it exposes to switch to passive mode seems to be a no- 
> op.  What exactly did you mean by "streams the data while not  
> buffering the data"?  Best,
> Adam
> On Jun 12, 2009, at 8:03 AM, Damien Katz wrote:
>> I don't think this is TCPs fault, it's the HTTP client. We need a  
>> HTTP client that streams data while not buffering the data (low  
>> level TCP already buffers some), instead of sending all the data  
>> that comes in to the waiting process, essentially buffering  
>> everything.
>> -Damien
>> On Jun 11, 2009, at 4:14 PM, Adam Kocoloski wrote:
>>> I had some time to work on a replicator that queries _changes  
>>> instead of _all_docs_by_seq today.  The first question that came  
>>> to my mind was how to put a spigot on the firehose.  If I call  
>>> _changes without a "since" qs parameter on a 10M document DB I'm  
>>> going to get 10M chunks of output back.
>>> I thought I might be able to control the flow at the TCP socket  
>>> level using the inets HTTP client's {stream,{self,once}} option.   
>>> I still think this would be an elegant option if I can get it to  
>>> work, but my early tests show that all the chunks still show up  
>>> immediately in the calling process regardless of whether I stream  
>>> to self or {self,once}.
>>> All for now, Adam

View raw message