hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Design for security in Hadoop
Date Thu, 19 Mar 2009 11:31:15 GMT
Amandeep Khurana wrote:
> Apparently, the file attached was striped off. Here's the link for where you
> can get it:
> http://www.soe.ucsc.edu/~akhurana/Hadoop_Security.pdf
> Amandeep

This is a good paper with test data to go alongside the theory
-I'd cite NFS as a good equivalent design, the same "we trust you to be 
who you say you are" protocol, similar assumptions about the network 
("only trusted machines get on it")
-If EC2 does not meet these requirements, you could argue it's  fault of 
EC2; there's no fundamental reason why it can't offer private VPNs for 
clusters the way other infrastructure (VMWare) can
-the whoami call is done by the command line client; different clients 
don't even have to do that. Mine doesn't.
-it is not the "superuser" in unix sense, "root", that runs jobs, it is 
whichever user started hadoop on that node. It can still be a locked 
down user with limited machine rights.

  -unauthorised nodes spoofing other IP addresses (via ARP attacks) and 
becoming nodes in the cluster. You could acquire and then keep or 
destroy data, or pretend to do work and return false values.  Or come up 
as a spoof namenode datanode and disrupt all work.
-denial of service attacks: too many heartbeats, etc
-spoof clients running malicious code on the tasktrackers.

-SSL does need to deal with trust; unless you want to pay for every 
server certificate (you may be able to share them), you'll
need to set up your own CA and issuing private certs -leaving you with 
the problem of securiing distributing CA public keys and getting SSL 
private keys out to nodes securely (and not anything on the net trying 
to use your kickstart server to boot a VM with the same mac address as a 
trusted server just to get at those keys)

-I'll have to get somebody who understands security protocols to review 
the paper. One area I'd flag as trouble is that on virtual machines, 
clock drift can be choppy and non-linear. You also have to worry about 
clients not being in the right time zone. It is good for everything to 
work off one clock (say the namenode) rather than their own. Amazon's S3 
authentication protocol has this bug, as do the bits of WS-DM which take 
absolute times rather than relative ones (presumably to make operations 
idempotent). A the very least, the namenode needs an operation to return 
its current time, which callers can then work off

-any  implementation should be allowed to use different (userid, 
credentials)  than (whoami , ~/.hadoop). This is to allow workflow 
servers and the like to schedule work as different users.
-server side should log success/failures to different Log categories; 
with that an JMX instrumentation you can track security attacks.

Overall, a nice paper. Do you have the patches to try it out on a bigger 

View raw message