couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: replication using _changes API
Date Fri, 12 Jun 2009 15:06:21 GMT
On Jun 12, 2009, at 10:59 AM, Paul Davis wrote:

> On Fri, Jun 12, 2009 at 10:47 AM, Damien Katz<damien@apache.org>  
> wrote:
>>
>>
>> On Jun 12, 2009, at 8:59 AM, Adam Kocoloski wrote:
>>
>>> Hi Damien, I'm not sure I follow.  My worry was that, if I built a
>>> replicator which only queried _changes to get the list of updates,  
>>> I'd have
>>> to be prepared to process a very large response.  I thought one  
>>> smart way to
>>> process this response was to throttle the download at the TCP  
>>> level by
>>> putting the socket into passive mode.
>>
>> You will have a very large response, but you can stream it,  
>> processing one
>> line at a time, then you discard the line and process the next. As  
>> long as
>> the writer is using a blocking socket and the reader is only  
>> reading as much
>> data as necessary to process a line, you never need to store much  
>> of the
>> data in memory on either side. But it seems the HTTP client is  
>> buffering the
>> data as it comes in, perhaps unintentionally.
>>
>> With TCP, the sending side will only send so much data before  
>> getting an
>> ACK, acknowledgment that packets sent were actually received. When  
>> an ACK
>> isn't received, the sender stops sending, and the TCP calls will  
>> block at
>> the sender (or return an error if the socket is in non-blocking  
>> mode), until
>> it gets a response or socket timeout.
>>
>> So if you have a non-buffering reader and a blocking sender, then  
>> you can
>> stream the data and only relatively small amounts of data are  
>> buffered at
>> any time. The problem is the reader in the HTTP client isn't  
>> waiting for the
>> data to be demanded at all, instead as soon as data comes in, it  
>> sends it to
>> a receiving erlang process. Erlang processes never block to receive
>> messages, so there is no limit to the amount of data buffered. So  
>> if the
>> Erlang process can't process the data fast enough, it starts getting
>> buffered in it's mailbox, consuming unlimited memory.
>>
>> Assuming I understand the problem correctly, the way to fix it is  
>> to have
>> the HTTP client not read the data until it's demanded by the  
>> consuming
>> process. Then we are only using the default TCP buffers, not the  
>> Erlang
>> message queues as a buffer, and the total amount of memory used at  
>> anytime
>> is small.
>>
>
> Dunno about HTTP clients, but when I was playing around with gen_tcp a
> week or two ago I found a parameter to opening a socket that is
> something like {active, false} that affects this specific
> functionality. Active sockets send tcp data as Erlang messages,
> inactive sockets don't and you have to get the data with
> gen_tcp:recv(Sock).
>
> I haven't the foggiest if the HTTP bits expose any of that though.

As far as I can tell, the {stream,{self,once}} translates to an  
inet:setopts(socket(), [{active,once}]), which accomplishes the same  
basic goal as {active,false}, just with repeated calls to setopts(Sock, 
[{active,once}]) instead of gen_tcp:recv(Sock).  I must be missing  
something, though, because clearly I'm getting more messages than I  
asked for.

I'm sure I could cook up something simple using gen_tcp directly, but  
even I'll have to deal with authentication, ssl, etc. so I'd prefer to  
use a full-fledged HTTP client if I can get it to work.  Best,

Adam


Mime
View raw message