hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xianqing Yu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8803) Make Hadoop running more secure public cloud envrionment
Date Fri, 14 Sep 2012 01:12:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455509#comment-13455509

Xianqing Yu commented on HADOOP-8803:

Hi Luke,

1. No, more restrictive HDFS delegation token and Block Token are used to do byte-range access
control, and new Block Token can reduce the damage when Block Token key is compromised. As
Owen said, I am thinking that put both file-level check and byte-level check as options in
the configuration file, so users can decide which level security they want and which type
of check is compatible with their code. I would like to test those kinds of job, do you guys
have any examples of this kind of code I can try to run?

2. Yes, I use unique key. Right, extra block tokens are needed, Each Block Token can only
be used for one datanode. For example, if I want to access data which store on datanode A
and B, then Namenode needs to generate two Block Tokens and send them to me. This is the largest
extra overhead in my design. But I think (please correct if I am wrong) for original Hadoop,
when one job is running, Namenode need to perform Block Token generate operation whenever
task process need to access. So that means for one job, Namenode need to perform the number
of Block Token generate operations as the number of mapper. So for my work, extra workload
is only happening when one mapper need to access data which is on more than one datanode.
And I don't think that is always happening. 

Another argument is that sharing the same key for all HDFS cluster is too risky. This overhead
is something hadoop have to paid.

3. It is very interesting question. In the security area, I think it is really hard to find
perfect security solution, but we always can find a better way. I do love the way we can discuss
varies possibilities here. Back to your question, zero-day breaches is really big threat and
that depends on a lot of things, as you said, which most of them are beyond the Hadoop itself.
TT/DN may have the same OS/software version, however, if hadoop is running in public cloud,
they are maybe running under different cloud provider, and OS may different and people who
maintaining those machines are different.  
> Make Hadoop running more secure public cloud envrionment
> --------------------------------------------------------
>                 Key: HADOOP-8803
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8803
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, ipc, security
>    Affects Versions:
>            Reporter: Xianqing Yu
>              Labels: hadoop
>   Original Estimate: 2m
>  Remaining Estimate: 2m
> I am a Ph.D student in North Carolina State University. I am modifying the Hadoop's code
(which including most parts of Hadoop, e.g. JobTracker, TaskTracker, NameNode, DataNode) to
achieve better security.
> My major goal is that make Hadoop running more secure in the Cloud environment, especially
for public Cloud environment. In order to achieve that, I redesign the currently security
mechanism and achieve following proprieties:
> 1. Bring byte-level access control to Hadoop HDFS. Based on 0.20.204, HDFS access control
is based on user or block granularity, e.g. HDFS Delegation Token only check if the file can
be accessed by certain user or not, Block Token only proof which block or blocks can be accessed.
I make Hadoop can do byte-granularity access control, each access party, user or task process
can only access the bytes she or he least needed.
> 2. I assume that in the public Cloud environment, only Namenode, secondary Namenode,
JobTracker can be trusted. A large number of Datanode and TaskTracker may be compromised due
to some of them may be running under less secure environment. So I re-design the secure mechanism
to make the damage the hacker can do to be minimized.
> a. Re-design the Block Access Token to solve wildly shared-key problem of HDFS. In original
Block Access Token design, all HDFS (Namenode and Datanode) share one master key to generate
Block Access Token, if one DataNode is compromised by hacker, the hacker can get the key and
generate any  Block Access Token he or she want.
> b. Re-design the HDFS Delegation Token to do fine-grain access control for TaskTracker
and Map-Reduce Task process on HDFS. 
> In the Hadoop 0.20.204, all TaskTrackers can use their kerberos credentials to access
any files for MapReduce on HDFS. So they have the same privilege as JobTracker to do read
or write tokens, copy job file, etc.. However, if one of them is compromised, every critical
thing in MapReduce directory (job file, Delegation Token) is exposed to attacker. I solve
the problem by making JobTracker to decide which TaskTracker can access which file in MapReduce
Directory on HDFS.
> For Task process, once it get HDFS Delegation Token, it can access everything belong
to this job or user on HDFS. By my design, it can only access the bytes it needed from HDFS.
> There are some other improvement in the security, such as TaskTracker can not know some
information like blockID from the Block Token (because it is encrypted by my way), and HDFS
can set up secure channel to send data as a option.
> By those features, Hadoop can run much securely under uncertain environment such as Public
Cloud. I already start to test my prototype. I want to know that whether community is interesting
about my work? Is that a value work to contribute to production Hadoop?

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message