hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: HTTP transport?
Date Fri, 09 Oct 2009 19:56:30 GMT
Sanjay Radia wrote:
> Will the RPC over HTTP be transparent so that that we can replace with a
> different layer if needed?


> My worry was the separation of data and checksums; someone had mentioned
> that one could do this over 2 RPCs - that is not transparent.

That was suggested as a possibility if we did not want to use RPC for 
data, but rather raw HTTP, e.g., with a separate URL per block.  The 
zerocopy support built into most HTTP servers only supports entire 
responses from a single file, so if we wanted to take advantage of these 
zerocopy implementations we'd not use RPC for block access, but could 
use HTTP and hence share security, etc.  Using raw HTTP for block access 
might also perform better, since it can use TCP flow control, rather 
than RPC call/response.  In my microbenchmarks, RPC call/response was 
fast enough to easily saturate disks and networks, so that might be 
moot, although RPC call/response for file data may use more CPU than 
we'd like.  With our own transport implementation we could get RPC 
call/response to use zerocopy for file data.

> I assume that we
> going to create a branch that moves the data transfer protocols to RPC and
> test the performance and if it is good then we commit and move to RPC?

Yes.  We obviously cannot change the file data transfer protocol without 
benchmarking.  Ideally file data transfer can share as much as possible 
with other protocols.  The most optimistic approach would be to use 
HTTP-based RPC call/response, so we ought to benchmark that.  This was 
the purpose of my recently-reported microbenchmarks.

We also need to determine whether both TCP flow-control and zerocopy are 
critical to data file performance.  If both are indeed critical, and 
HTTP proves sufficient for everything else, then we should consider 
using non-RPC HTTP for file data transfer, since it supports both 
zerocopy and TCP-based flow control, and the implementation of security, 
etc. could be shared.  But, on the other hand, if HTTP is deemed 
inappropriate for security and we develop our own RPC transport that 
permits zerocopy, and TCP flow-control over entire blocks is not 
required, then we might use RPC for file data.  What I'm hoping we can 
avoid is, as today, using different transports for different protocols, 
re-implementing security, connection pooling, async request processing, 
etc. for each, requiring separate configuration and ports for each, etc. 
  But even that might be required.  We don't know yet.

I think starting with HTTP as a hypothesis permits us to make progress 
without a lot of up-front investment.


View raw message