couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: replication using _changes API
Date Fri, 12 Jun 2009 14:59:47 GMT
On Fri, Jun 12, 2009 at 10:47 AM, Damien Katz<damien@apache.org> wrote:
>
>
> On Jun 12, 2009, at 8:59 AM, Adam Kocoloski wrote:
>
>> Hi Damien, I'm not sure I follow.  My worry was that, if I built a
>> replicator which only queried _changes to get the list of updates, I'd have
>> to be prepared to process a very large response.  I thought one smart way to
>> process this response was to throttle the download at the TCP level by
>> putting the socket into passive mode.
>
> You will have a very large response, but you can stream it, processing one
> line at a time, then you discard the line and process the next. As long as
> the writer is using a blocking socket and the reader is only reading as much
> data as necessary to process a line, you never need to store much of the
> data in memory on either side. But it seems the HTTP client is buffering the
> data as it comes in, perhaps unintentionally.
>
> With TCP, the sending side will only send so much data before getting an
> ACK, acknowledgment that packets sent were actually received. When an ACK
> isn't received, the sender stops sending, and the TCP calls will block at
> the sender (or return an error if the socket is in non-blocking mode), until
> it gets a response or socket timeout.
>
> So if you have a non-buffering reader and a blocking sender, then you can
> stream the data and only relatively small amounts of data are buffered at
> any time. The problem is the reader in the HTTP client isn't waiting for the
> data to be demanded at all, instead as soon as data comes in, it sends it to
> a receiving erlang process. Erlang processes never block to receive
> messages, so there is no limit to the amount of data buffered. So if the
> Erlang process can't process the data fast enough, it starts getting
> buffered in it's mailbox, consuming unlimited memory.
>
> Assuming I understand the problem correctly, the way to fix it is to have
> the HTTP client not read the data until it's demanded by the consuming
> process. Then we are only using the default TCP buffers, not the Erlang
> message queues as a buffer, and the total amount of memory used at anytime
> is small.
>

Dunno about HTTP clients, but when I was playing around with gen_tcp a
week or two ago I found a parameter to opening a socket that is
something like {active, false} that affects this specific
functionality. Active sockets send tcp data as Erlang messages,
inactive sockets don't and you have to get the data with
gen_tcp:recv(Sock).

I haven't the foggiest if the HTTP bits expose any of that though.

> -Damien
>
>>
>>
>> I agree that the HTTP client seems to be at fault, because the option that
>> it exposes to switch to passive mode seems to be a no-op.  What exactly did
>> you mean by "streams the data while not buffering the data"?  Best,
>>
>> Adam
>>
>> On Jun 12, 2009, at 8:03 AM, Damien Katz wrote:
>>
>>> I don't think this is TCPs fault, it's the HTTP client. We need a HTTP
>>> client that streams data while not buffering the data (low level TCP already
>>> buffers some), instead of sending all the data that comes in to the waiting
>>> process, essentially buffering everything.
>>>
>>> -Damien
>>>
>>>
>>> On Jun 11, 2009, at 4:14 PM, Adam Kocoloski wrote:
>>>
>>>> I had some time to work on a replicator that queries _changes instead of
>>>> _all_docs_by_seq today.  The first question that came to my mind was how
to
>>>> put a spigot on the firehose.  If I call _changes without a "since" qs
>>>> parameter on a 10M document DB I'm going to get 10M chunks of output back.
>>>>
>>>> I thought I might be able to control the flow at the TCP socket level
>>>> using the inets HTTP client's {stream,{self,once}} option.  I still think
>>>> this would be an elegant option if I can get it to work, but my early tests
>>>> show that all the chunks still show up immediately in the calling process
>>>> regardless of whether I stream to self or {self,once}.
>>>>
>>>> All for now, Adam
>>>
>>
>
>

Mime
View raw message