hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Kostyrka <andr...@kostyrka.org>
Subject Re: hadoop on EC2
Date Wed, 04 Jun 2008 00:04:13 GMT
Well, the basic "trouble" with EC2 is that clusters usually are not networks 
in the TCP/IP sense.

This makes it painful to decide which URLs should be resolved where.

Plus to make it even more painful, you cannot easily run it with one simple 
SOCKS server, because you need to defer DNS resolution to the inside the 
cluster, because VM names do resolve to external IPs, while the webservers 
we'd be all interested in reside on the internal 10/8 IPs.

Another fun item is that in many situations you will have multiple islands 
inside EC2 (the contractor working for multiple customers that have EC2 
deployments come to mind), so you cannot just route everything over one pipe 
into EC2.

My current setup relies on a very long list of -L ssh tunnel forwards plus 
iptables into the nat OUTPUT rule that make external-ip-of-vm1:50030 get 
redirected to localhost:SOMEPORT that is forwarded to name-of-vm1:50030 via 
ssh. (Implementation left as an exercise for the reader, or my ugly non-error 
checking script available on request :-P)

If one would want to have a more generic solution to redirect TCP ports via a 
ssh SOCKS tunnel (aka "dynamic port forwarding"), the following components 
would be needed:

-) a list of rules what gets forwarded where and how.
-) a DNS resolver that issues fake IP addresses to capture the "name" of the 
connected host.
-) a small forwarding script that checks the "real destination IP" to decide 
which IP address/port is being requested. (Hint: current Linux kernels don't 
use getsockname anymore, the real destination is carried nowadays as a socket 

One of the uglier parts that I have found no "real" solution was the fact that 
one cannot be sure that ssh will be able to listen on a given port. 

Solutions I've found include:
-) check the port before issueing ssh (Racecondition warning: Going through 
this hole the whole federation star fleet could get lost.)
-) using some kind of except to drive ssh through a pty.
-) roll your own ssh tunnel solution. The only lib that come to my mind is 
Twisted, in which case one could ignore the need for the SOCKS protocol.

But luckily for us, the solution is easier, because we only need to tunnel 
http in the hadoop case, which has the high benefit that we do not need to 
capture the hostname, because http remembers the hostname inside the payload.

Not tested, but the following should work:
1.) Setup a proxy on the cluster somewhere. Make it do auth (proxy auth might 
work too, but depending upon how one makes the browser access the proxy this 
might be a bad idea).
2.) Make the client access the proxy for the needed hosts/port combinations. 
FoxyProxy or similiar extensions for firefox come to mind, or some 
destination nat rules on your packet firewall should do the trick.


On Monday 02 June 2008 20:27:53 Chris K Wensel wrote:
> > obviously this isn't the best solution if you need to let many semi
> > trusted users browse your cluster.
> Actually, it would be much more secure if the tunnel service ran on a
> trusted server letting your users connect remotely via SOCKS and then
> browse the cluster. These users wouldn't need any AWS keys etc.
> Chris K Wensel
> chris@wensel.net
> http://chris.wensel.net/
> http://www.cascading.org/

View raw message