Steve Loughran
Re: hadoop on EC2
Wed, 04 Jun 2008 11:10:46 GMT
Andreas Kostyrka wrote:
> Well, the basic "trouble" with EC2 is that clusters usually are not networks 
> in the TCP/IP sense.
> This makes it painful to decide which URLs should be resolved where.
> Plus to make it even more painful, you cannot easily run it with one simple 
> SOCKS server, because you need to defer DNS resolution to the inside the 
> cluster, because VM names do resolve to external IPs, while the webservers 
> we'd be all interested in reside on the internal 10/8 IPs.
> Another fun item is that in many situations you will have multiple islands 
> inside EC2 (the contractor working for multiple customers that have EC2 
> deployments come to mind), so you cannot just route everything over one pipe 
> into EC2.
> My current setup relies on a very long list of -L ssh tunnel forwards plus 
> iptables into the nat OUTPUT rule that make external-ip-of-vm1:50030 get 
> redirected to localhost:SOMEPORT that is forwarded to name-of-vm1:50030 via 
> ssh. (Implementation left as an exercise for the reader, or my ugly non-error 
> checking script available on request :-P)
> If one would want to have a more generic solution to redirect TCP ports via a 
> ssh SOCKS tunnel (aka "dynamic port forwarding"), the following components 
> would be needed:
> -) a list of rules what gets forwarded where and how.
> -) a DNS resolver that issues fake IP addresses to capture the "name" of the 
> connected host.
> -) a small forwarding script that checks the "real destination IP" to decide 
> which IP address/port is being requested. (Hint: current Linux kernels don't 
> use getsockname anymore, the real destination is carried nowadays as a socket 
> option)
> One of the uglier parts that I have found no "real" solution was the fact that 
> one cannot be sure that ssh will be able to listen on a given port. 
> Solutions I've found include:
> -) check the port before issueing ssh (Racecondition warning: Going through 
> this hole the whole federation star fleet could get lost.)
> -) using some kind of except to drive ssh through a pty.
> -) roll your own ssh tunnel solution. The only lib that come to my mind is 
> Twisted, in which case one could ignore the need for the SOCKS protocol.
> But luckily for us, the solution is easier, because we only need to tunnel 
> http in the hadoop case, which has the high benefit that we do not need to 
> capture the hostname, because http remembers the hostname inside the payload.

Do you worry/address the risk of someone like me bringing up a machine 
in the EC2 farm that then portscans all the near-neighbours in the 
address space for open hdfs data node/name node ports, and strikes up a 
conversation with your filesystem?

Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

