couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: replication using _changes API
Date Fri, 12 Jun 2009 22:38:00 GMT
On Jun 12, 2009, at 5:56 PM, Chris Anderson wrote:

> On Fri, Jun 12, 2009 at 8:06 AM, Adam Kocoloski<kocolosk@apache.org>  
> wrote:
>> On Jun 12, 2009, at 10:59 AM, Paul Davis wrote:
>>
>>> On Fri, Jun 12, 2009 at 10:47 AM, Damien Katz<damien@apache.org>  
>>> wrote:
>>>>
>>>>
>>>> On Jun 12, 2009, at 8:59 AM, Adam Kocoloski wrote:
>>>>
>>>>> Hi Damien, I'm not sure I follow.  My worry was that, if I built a
>>>>> replicator which only queried _changes to get the list of  
>>>>> updates, I'd
>>>>> have
>>>>> to be prepared to process a very large response.  I thought one  
>>>>> smart
>>>>> way to
>>>>> process this response was to throttle the download at the TCP  
>>>>> level by
>>>>> putting the socket into passive mode.
>>>>
>>>> You will have a very large response, but you can stream it,  
>>>> processing
>>>> one
>>>> line at a time, then you discard the line and process the next.  
>>>> As long
>>>> as
>>>> the writer is using a blocking socket and the reader is only  
>>>> reading as
>>>> much
>>>> data as necessary to process a line, you never need to store much  
>>>> of the
>>>> data in memory on either side. But it seems the HTTP client is  
>>>> buffering
>>>> the
>>>> data as it comes in, perhaps unintentionally.
>>>>
>>>> With TCP, the sending side will only send so much data before  
>>>> getting an
>>>> ACK, acknowledgment that packets sent were actually received.  
>>>> When an ACK
>>>> isn't received, the sender stops sending, and the TCP calls will  
>>>> block at
>>>> the sender (or return an error if the socket is in non-blocking  
>>>> mode),
>>>> until
>>>> it gets a response or socket timeout.
>>>>
>>>> So if you have a non-buffering reader and a blocking sender, then  
>>>> you can
>>>> stream the data and only relatively small amounts of data are  
>>>> buffered at
>>>> any time. The problem is the reader in the HTTP client isn't  
>>>> waiting for
>>>> the
>>>> data to be demanded at all, instead as soon as data comes in, it  
>>>> sends it
>>>> to
>>>> a receiving erlang process. Erlang processes never block to receive
>>>> messages, so there is no limit to the amount of data buffered. So  
>>>> if the
>>>> Erlang process can't process the data fast enough, it starts  
>>>> getting
>>>> buffered in it's mailbox, consuming unlimited memory.
>>>>
>>>> Assuming I understand the problem correctly, the way to fix it is  
>>>> to have
>>>> the HTTP client not read the data until it's demanded by the  
>>>> consuming
>>>> process. Then we are only using the default TCP buffers, not the  
>>>> Erlang
>>>> message queues as a buffer, and the total amount of memory used at
>>>> anytime
>>>> is small.
>>>>
>>>
>>> Dunno about HTTP clients, but when I was playing around with  
>>> gen_tcp a
>>> week or two ago I found a parameter to opening a socket that is
>>> something like {active, false} that affects this specific
>>> functionality. Active sockets send tcp data as Erlang messages,
>>> inactive sockets don't and you have to get the data with
>>> gen_tcp:recv(Sock).
>>>
>>> I haven't the foggiest if the HTTP bits expose any of that though.
>>
>> As far as I can tell, the {stream,{self,once}} translates to an
>> inet:setopts(socket(), [{active,once}]), which accomplishes the  
>> same basic
>> goal as {active,false}, just with repeated calls to
>> setopts(Sock,[{active,once}]) instead of gen_tcp:recv(Sock).  I  
>> must be
>> missing something, though, because clearly I'm getting more  
>> messages than I
>> asked for.
>>
>> I'm sure I could cook up something simple using gen_tcp directly,  
>> but even
>> I'll have to deal with authentication, ssl, etc. so I'd prefer to  
>> use a
>> full-fledged HTTP client if I can get it to work.  Best,
>>
>
> Oscar from Erlang Training and Consulting has just open-sourced one of
> their HTTP clients, which may be a better fit than ibrowse as it seems
> to be a much thinner layer. There is some discussion of it on the
> Erlang Questions list, but the most useful link is probably to the
> source code:
>
> http://bitbucket.org/etc/lhttpc
>
> This does not support streaming the response body yet, but Oscar's
> told me that it shouldn't be hard to add. So this may be just the
> thing for getting a raw connection to the socket, without having to
> worry about auth, ssl, etc.

Hi Chris, I saw Oscar's announcement on erlang-questions and checked  
out the code.  It definitely won't work for us without the ability to  
read a chunked response, but if he adds that and the streaming option  
I'm sure it'll be worth a closer look.

In my opinion it's unfortunate that we have a proliferation of HTTP  
clients instead of one really solid implementation.  I'm sure ETC had  
its reasons for starting from scratch instead of contributing to  
ibrowse or inets.  Oscar certainly did a great job of describing the  
limitations of both in these two messages:

http://groups.google.com/group/erlang-programming/browse_thread/thread/bc5db72fbe2ac9c7
http://groups.google.com/group/erlang-programming/browse_thread/thread/a896b641348a50ca

Cheers, Adam

Mime
View raw message