couchdb-replication mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yaron Goland <yar...@microsoft.com>
Subject RE: _bulk_get protocol extension
Date Tue, 28 Jan 2014 16:52:56 GMT
I did read it and I didn't agree with it.

> 	* A single slow response blocks all requests behind it.

The same is true of bulk get. Remember the only thing that can be pipelined are non-idempotent
methods which generally means GET. So if a single GET can slow down the whole pipeline so
a single 'virtual' GET in a bulk GET request can slow down the response as well.

> 	* When processing in parallel, servers must buffer pipelined
> responses, which may exhaust server resources-e.g., what if one of the
> responses is very large? This exposes an attack vector against the server!

This is the whole point of buffer control in TCP. The server only pulls off what it can handle.
If a client sends more requests than the server can handle then the server stops servicing
the buffer and buffer control automatically pushes back on the client. 

Put another way if this attack works then a client can replicate it without pipelining just
by making multiple independent requests. So either a server protects itself from DOS by clients
or it doesn't, pipelining doesn't change anything.

> 	* A failed response may terminate the TCP connection, forcing the
> client to re-request all the subsequent resources, which may cause duplicate
> processing.

Certainly nothing in HTTP requires such termination so what this point is really says 'bad
clients will throw exceptions on non-200 responses'. Well bad clients are going to do a lot
of bad silly stupid things. If they use decent libraries (e.g. Apache, .net, etc.) this isn't
a problem because the exception won't terminate the connection. The connection is actually
part of a pool and is managed separately.

So yes, bad clients will do bad things but that applies no matter what so I don't see it worth
worrying about.

> 	* Detecting pipelining compatibility reliably, where intermediaries
> may be present, is a nontrivial problem.

Pipelining is point to point, not end to end. In other words if the intermediary is returning
1.1 responses then it is a 1.1 intermediary otherwise its job is to return 1.0 even if the
upstream system it's talking to is 1.1. So pipelining happens. So each hop only needs to probe
its next hop.

> 	* Some intermediaries do not support pipelining and may abort the
> connection, while others may serialize all requests.

Intermediaries that don't support pipelining publish 1.0 for just that reason. And serialization
is always a possibility but the server can do the same serialization. So yes, bad infrastructure
is bad infrastructure. But that isn't a reason to abandon the protocol and invent a new protocol
to crawl through the old one.

So personally I'm having trouble buying the protocol argument. But you make two arguments
in your email that seem well positioned to have a really productive conversation about.

Your first argument is that the overhead of GET is so bad that even in the face of pipelining
the performance will still be significantly worse than a bulk request. Well you said you already
implemented bulk requests. So um... why not publish some numbers and the code you used to
generate it?

The same argument applies to ZIP and the benefits of ZIPping similar data. You said you already
have this up and running. So why not just publish some numbers comparing a non-pipelined connection,
a pipelined connection and your bulk GET? You can show latency, bandwidth and CPU load.

I suspect those numbers would make for a more productive conversation.

	Thanks,

			Yaron

> -----Original Message-----
> From: Jens Alfke [mailto:jens@couchbase.com]
> Sent: Monday, January 27, 2014 9:13 PM
> To: replication@couchdb.apache.org
> Subject: Re: _bulk_get protocol extension
> 
> 
> On Jan 27, 2014, at 7:26 PM, Yaron Goland <yarong@microsoft.com> wrote:
> 
> > Nevertheless he did say that so long as one probes the connection then
> pipelining is known to work. Probing just means that you can't assume that
> the server you are talking to is a 1.1 server and therefore supports pipelining.
> 
> Well, yes, that's pretty clear - I mean, I know pipelining's been
> implemented. (And on iOS and Mac the frameworks already know how to
> support pipelining, so one doesn't have to do the probing oneself.)
> 
> The problems with pipelining are higher level than that. Did you read the text
> by Ilya Grigorik that I linked to? Here's another excerpt:
> 
> 	* A single slow response blocks all requests behind it.
> 	* When processing in parallel, servers must buffer pipelined
> responses, which may exhaust server resources-e.g., what if one of the
> responses is very large? This exposes an attack vector against the server!
> 	* A failed response may terminate the TCP connection, forcing the
> client to re-request all the subsequent resources, which may cause duplicate
> processing.
> 	* Detecting pipelining compatibility reliably, where intermediaries
> may be present, is a nontrivial problem.
> 	* Some intermediaries do not support pipelining and may abort the
> connection, while others may serialize all requests.
> -
> http://chimera.labs.oreilly.com/books/1230000000545/ch11.html#HTTP_PIPE
> LINING
> 
> (Now, HTTP 2.0 is adding multiplexing, which alleviates most of those
> problems. I'll be happy when we get to use it, but that probably won't be for
> a year or two at least.)
> 
> I also mentioned the overhead of issuing a bunch of HTTP requests versus
> just one. As a thought experiment, consider fetching a one-megabyte HTTP
> resource by using a thousand byte-range GET requests each requesting 1K of
> the file. Would this take longer than issuing a single GET request for the
> entire resource? Yeah, and probably a lot longer, even with pipelining. The
> client and the server both introduce overhead in handling requests.
> 
> Finally, consider that putting a number of related resources together into a
> single body enables better compression, since general-purpose compression
> algorithms look for repeated patterns. If I have a thousand small documents
> each of which contains a property named "this_is_my_custom_property",
> then if all those documents are returned in one response each instance of
> that string will get compressed down to a very short token. If they're
> separate responses, the string won't get compressed.
> 
> -Jens

Mime
View raw message