hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Hadoop and security.
Date Mon, 06 Oct 2008 09:48:14 GMT
Dmitry Pushkarev wrote:
> Dear hadoop users, 
> 
>  
> 
> I'm lucky to work in academic environment where information security is not
> the question. However, I'm sure that most of the hadoop users aren't. 
> 
>  
> 
> Here is the question: how secure hadoop is?  (or let's say foolproof)

Right now hadoop is about as secure as NFS. when deployed onto private 
datacentres with good physical security and well set up networks, you 
can control who gets at the data. Without that, you are sharing your 
state with anyone who can issue HTTP and hadoop IPC requests.

> 
>  
> 
> Here is the answer: http://www.google.com/search?client=opera
> <http://www.google.com/search?client=opera&rls=en&q=Hadoop+Map/Reduce+Admini
> stration&sourceid=opera&ie=utf-8&oe=utf-8>
> &rls=en&q=Hadoop+Map/Reduce+Administration&sourceid=opera&ie=utf-8&oe=utf-8
> not quite.
> 

see also http://www.google.com/search?q=axis+happiness+page  ; pages 
that we add for benefit of the ops team end up sneaking out into the big 
net.

>  
> 
> What we're seeing here is open hadoop cluster, where anyone who capable of
> installing hadoop and changing his username to webcrawl can use their
> cluster and read their data, even though firewall is perfectly installed and
> ports like ssh are filtered to outsiders. After you've played enough with
> data, you can observe that you can submit jobs as well, and these jobs can
> execute shell commands. Which is very, very sad.
> 
>  
> 
> In my view, this significantly limits distributed hadoop applications, where
> part of your cluster may reside on EC2 or other distant datacenter, since
> you always need to have certain ports open to an array of ip addresses (if
> your instances are dynamic) which isn't acceptable if anyone from that ip
> range can connect to your cluster.

well, maybe that's a fault of EC2s architecture in which a deployment 
request doesn't include a declaration of the network configuration?

> 
>  
> 
> Can we propose to developers to introduce some basic user-management and
> access controls to help hadoop make one step further towards
> production-quality system?
> 


Being an open source project, you can do more than propose, you can help 
build some basic user-management and access controls. As to "production 
quality"; it is ready for production, albeit in locked down datacentres. 
Which is the primary deployment infrastructure of many of the active 
developers. As in most community-contributed open source projects, if 
you have specific needs beyond what the active developers need, you end 
up implementing them your self.

The big issue with security is that it is all or nothing. Right now it 
is blatantly insecure, so you should not be surprised that anyone has 
access to your files. To actually lock it down, you would need to 
authenticate and possibly encrypt all communications; this adds a lot of 
overhead, which is why it will be avoided in the big datacentres. You 
also need to go to a lot of effort to make sure it is secure across the 
board, with no JSP pages providing accidental privilege escalation, no 
api calls letting you see stuff you shouldn't. Its not like a normal 
feature defect where you can say "don't do that"; it's not so easy to 
validate using functional tests that test the expected uses of the code. 
This is why securing an application is such a hard thing to do.



-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Mime
View raw message