hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Design for security in Hadoop
Date Fri, 20 Mar 2009 14:15:43 GMT
Amandeep Khurana wrote:
> Thanks for the feedback Steve.
> 
> My response on the points that you have mentioned are written inline below.
> 
> Amandeep
> 
> 
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
> 
> 
> On Thu, Mar 19, 2009 at 4:31 AM, Steve Loughran <stevel@apache.org> wrote:
> 
>> Amandeep Khurana wrote:
>>
>>> Apparently, the file attached was striped off. Here's the link for where
>>> you
>>> can get it:
>>> http://www.soe.ucsc.edu/~akhurana/Hadoop_Security.pdf<http://www.soe.ucsc.edu/%7Eakhurana/Hadoop_Security.pdf>
>>>
>>> Amandeep
>>>
>>>
>>>
>> This is a good paper with test data to go alongside the theory
>> Introduction
>> ========
>> -I'd cite NFS as a good equivalent design, the same "we trust you to be who
>> you say you are" protocol, similar assumptions about the network ("only
>> trusted machines get on it")
>> -If EC2 does not meet these requirements, you could argue it's  fault of
>> EC2; there's no fundamental reason why it can't offer private VPNs for
>> clusters the way other infrastructure (VMWare) can
>> -the whoami call is done by the command line client; different clients
>> don't even have to do that. Mine doesn't.
>> -it is not the "superuser" in unix sense, "root", that runs jobs, it is
>> whichever user started hadoop on that node. It can still be a locked down
>> user with limited machine rights.
> 
> 
> I'll look into the NFS security stuff in detail and then add it later.


The key point about NFS security is there was none, because the early 
eighties, the idea of a linux laptop getting on your wifi network was 
not conceivable, so you really could trust workstations. It was only 
with PC-NFS that the assumptions started to fail.

> 
> Where did EC2 come into picture?

Its an example of a place where Hadoop is deployed where the assumption 
that only trusted users have network access (and/or only fixed IP 
addresses can join the cluster) don't hold.

> 
> Yes, the whoami can be bypassed, thats why the whole thing around
> authentication.
> 
> By superuser, I meant the user who starts the hadoop instance... Will make
> it clearer in the writing.

OK

> 
> 
>>
>> Attacks
>> ====
>> Add
>>  -unauthorised nodes spoofing other IP addresses (via ARP attacks) and
>> becoming nodes in the cluster. You could acquire and then keep or destroy
>> data, or pretend to do work and return false values.  Or come up as a spoof
>> namenode datanode and disrupt all work.
>> -denial of service attacks: too many heartbeats, etc
>> -spoof clients running malicious code on the tasktrackers.
> 
> 
> I havent looked these attacks. This paper is not focussing on that. This can
> definitely be looked at and incorporated at a later stage. Lets go step by
> step. (Debatable)

I was just broadening the list of attacks. Spoofing joining the cluster 
is something to fear.
> 
>>
>> Protocol
>> ======
>> -SSL does need to deal with trust; unless you want to pay for every server
>> certificate (you may be able to share them), you'll
>> need to set up your own CA and issuing private certs -leaving you with the
>> problem of securiing distributing CA public keys and getting SSL private
>> keys out to nodes securely (and not anything on the net trying to use your
>> kickstart server to boot a VM with the same mac address as a trusted server
>> just to get at those keys)
> 
> 
> SSL is a possible solution but the details arent the focus of this design.
> Regarding the other keys, there is a format around which they are created
> and you dont need a CA for that.
> 
> 
>>
>> -I'll have to get somebody who understands security protocols to review the
>> paper. One area I'd flag as trouble is that on virtual machines, clock drift
>> can be choppy and non-linear. You also have to worry about clients not being
>> in the right time zone. It is good for everything to work off one clock (say
>> the namenode) rather than their own. Amazon's S3 authentication protocol has
>> this bug, as do the bits of WS-DM which take absolute times rather than
>> relative ones (presumably to make operations idempotent). A the very least,
>> the namenode needs an operation to return its current time, which callers
>> can then work off
> 
> 
> The time issue is definitely a concern and has to be somehow cracked. The
> namenode giving its time is a good idea. But the sync would still be
> important. There is a way to sync the time across the cluster. I dont
> remember it clearly, but I have it on my "little" cluster. I'll look that
> up.
> 

NTP is the normal protocol, everyone tries to use it. But asking the NN 
for its clock would avoid having to rely on everything being in sync at 
the OS level -and would let the client detect when its clock had drifted 
too far off for a conversation. One recurrent problem of mine is 
machines that are on NTP but whose time zones are wrong; they are 
perfectly accurate to the second but 8 hours out.

-steve

Mime
View raw message