hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Radia <sra...@yahoo-inc.com>
Subject Re: HTTP transport?
Date Fri, 09 Oct 2009 18:13:09 GMT

On 10/9/09 10:49 AM, "Doug Cutting" <cutting@apache.org> wrote:

> Owen O'Malley wrote:
>> SPNEGO is the 
>> standard method of using Kerberos with HTTP and we are planning to use
>> that for the web UI's.
> Java 6 also supports using SPNEGO for RPC over HTTP out of the box:
> http://java.sun.com/javase/6/docs/technotes/guides/net/http-auth.html
>> I also have serious doubts about performance, but that is hard to answer
>> until we have code to test.
> The good news is that, since the HTTP stuff is already implemented, we
> can test its performance easily.  Performance of insecure access over
> HTTP looks good so far.  It's an open question are how much HTTP-based
> security will slow things versus non-HTTP-based security.
>> It is an interesting question how much we
>> depend on being able to answer queries out of order. There are some
>> parts of the code where overlapping requests from the same client
>> matter. In particular, the terasort scheduler uses threads to access the
>> namenode. That would stop providing any pipelining, which I believe
>> would be significant.
> No, we wouldn't stop any pipelining, we'd just use more connections to
> implement it.  With HttpClient one can limit the number of pooled
> connnections per host:
> http://hc.apache.org/httpclient-3.x/apidocs/org/apache/commons/httpclient/Mult
> iThreadedHttpConnectionManager.html#setMaxConnectionsPerHost%28int%29
> Connections are not free of course, but Jetty has been benchmarked at
> 20,000 concurrent connections:
> http://cometdaily.com/2008/01/07/20000-reasons-that-comet-scales/
>> In short, I think that an HTTP transport is great for playing with, but
>> I don't think you can assume it will work as the primary transport.
> I agree, we cannot assume it.  But it's easy to try it and see how it
> fares.  Any investment in getting it working is perhaps not wasted,
> since, besides providing a performance baseline, it also may be useful
> to provide HTTP-based access to services even if a higher-performance
> option is implemented.

Will the RPC over HTTP be transparent so that that we can replace with a
different layer if needed?
My worry was the separation of data and checksums; someone had mentioned
that one could do this over 2 RPCs - that is not transparent.

Also the other issue is porting from data transfer socket streams to RPC -
that port will not be transparent. We cannot afford to loose performance
over that change. Further,  moving from streaming sockets to RPC is a very
significant code change to the dfs-client and data nodes. I assume that we
going to create a branch that moves the data transfer protocols to RPC and
test the performance and if it is good then we commit and move to RPC?
I am worried about this part - I am surprised that you two are not. Am I
missing something here?


> Doug

View raw message