hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Faris (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5569) WebHDFS should support a deny/allow list for data access
Date Wed, 04 Dec 2013 06:51:36 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13838667#comment-13838667

Adam Faris commented on HDFS-5569:

Hi Colin, Just so you know I'm not trying to push any agenda and I know you guys do great
work for the Hadoop ecosystem at Cloudera.  But there are flaws in your statements that need
to be corrected as others reading this at a later date will get be confused. 

I don't understand why you jump immediately to the assumption that IP spoofing is necessary
to break IP-based authentication. There are plenty of networks in the world where you can
join without any trouble. One example is a class B network such as 172.16.X.Y. If the administrator
tries to filter addresses 172.16.1.X but allow 172.16.2.X, it would be easy for an attacker
to reconfigure his IP from a 172.16.1.X to a 172.16.2.X using just ifconfig. 

Because you didn't mention the netmask in your example and the only thing that makes sense
for your example is the netmask being a "/16" ( If so then it's fine to assign
yourself a address from 172.16.1.x or 172.16.2.x because it's the same network space.  This
is not IP spoofing but merely assigning yourself an approved IP from the same network, therefor
you are not bypassing any security.  

In many cases, we can also use "source routing" to get around IP-based restrictions. Keep
in mind, this does not require spoofing! Source routing allows the packet to specify its own
route through the network. This potentially allows you to reach destinations that you would
not otherwise be able to get to. Many routers now disable source-routed packets, why open
a hole for those that do not?

This is handled by router configs and "loose source record route" is blocked on modern networks.
  This comment is not relevant to the changes I'm requesting to WebHDFS as it's controlled
by network gear as you acknowledged in your statement.

Now, let's turn to considering spoofing itself. Successful IP spoofing often does not allow
the attacker to get back a response to his packets. However, that isn't necessarily needed
in this case, because there are many webHDFS operations that delete files, damage data, etc.


There's more information on how to defeat IP-based filtering here: http://technet.microsoft.com/library/cc723706.aspx
It calls SNMP "a security disaster" partly because it often relies on IP-based filtering for
security. I don't think we should be trying to reproduce a security scheme that everyone agrees
is a disaster.

The technet document you referenced calls SNMP a disaster for a reason, it's UDP based.  As
UDP doesn't have a connection hand shake, the receiving system is of course going to trust
the source IP.  With TCP which is what WebHDFS uses, host A can send a SYN packet with a spoofed
IP, but the SYN-ACK hand shake reply is going to go to the real IP of host G which has the
actual address that was spoofed by host A.  The three way hand shake will never happen and
because the TCP connection will fail it's not possible to send a 'delete' or other simple
request to WebHDFS.  The referenced technet document is not relevant to this JIRA as it's
poking holes at UDP, not TCP. 

Physical security of networks is often an issue. Many times open ethernet jacks are available
in an office or data center and you can get an IP address. Maybe even one that is inside the
various firewalls. This is why people use real security systems like Kerberos, Active Directory,

We are in agreement that physical network jacks should be secured, but are not always guaranteed
to be secured.   But please understand, Kerberos/Active Directory is not authorization as
Kerberos only gives us authentication. (See my bank teller example above.) Due to the common
practice of cross-realm trusts, we can not just rely on limiting what networks can talk to
the KDC as my corp TGT is going to be trusted by the "hadoop" realm and work just fine.  It's
this limitation in Kerberos that is why I'm requesting a 'allow/deny' list based on IP.  

Regarding DNS: I've dealt with many clusters for whom DNS lookup was a bottleneck. You may
argue that they should have configured DNS better. But regardless, a security scheme that
requires contacting DNS all the time would still cause significant regressions for those users.
See Daryn Sharp's patch for https://issues.apache.org/jira/browse/HDFS-3990, which was designed
partly to avoid unnecessary DNS lookups. There have been many other such patches, from people
at Yahoo and other companies.

True and I recognize that as Cloudera is a consulting company, you guys see a lot of weird
stuff.  I'm not saying DNS lookups are never a bottleneck, just that there are already ways
of preventing the bottleneck with caching.  I did find HDFS-3990 an interesting read and if
I'm reading the patch correctly, Daryn is building a 'cache' of hostnames and IP's within
ram for the namenode.  Additionally the first comment in HDFS-3990 states that the the NN's
page load time was reduced to seconds with the name server caching daemon.  I've already mentioned
that the JVM itself can be configured to never release a hostname/ip mapping.  My point is
that it doesn't matter where the actual cache lives, just that having one helps and now we
have one more cache available to use.  I think we are in agreement that supporting hostname
lookups while isn't cost free, it isn't the end of the world.  

I think you should explain why the various alternatives people have offered here don't solve
your problem. It seems really easy to use httpfs (plus perhaps a proxy) to get filtering as
fine-grained as you want. The whole point of implementing httpfs was that the HTTP protocol
could easily be filtered, proxied, etc. by third-party tools. If there are use cases that
httpfs does not address, let's fix them rather than creating another parallel security system
that does not follow best practices.

Look above and you'll see I already explained why bolting nginx on top of the datanode jetty
port or using httpfs doesn't solve this problem.  To summarize # proxies add complexity to
troubleshooting client/server issues. # Tomcat/nginx is not part of the normal hadoop ecosystem.
Now I have another software service to securely configure, deploy, monitor, support, and integrate
with WebHDFS.  # Why replace jetty with something else when jetty offers the features I'm
requesting.  One just needs to add the jetty hooks into WebHDFS. # using a proxy for access
control when other proxy features aren't needed is like using Hadoop to process a 100kB text
file, it's overkill.

FYI: For those not familiar with what I'm requesting, it's adding an access control feature
like mod_auth_hostz in httpd. http://httpd.apache.org/docs/2.2/mod/mod_authz_host.html 

> WebHDFS should support a deny/allow list for data access
> --------------------------------------------------------
>                 Key: HDFS-5569
>                 URL: https://issues.apache.org/jira/browse/HDFS-5569
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: webhdfs
>            Reporter: Adam Faris
>              Labels: features
> Currently we can't restrict what networks are allowed to transfer data using WebHDFS.
 Obviously we can use firewalls to block ports, but this can be complicated and problematic
to maintain.  Additionally, because all the jetty servlets run inside the same container,
blocking access to jetty to prevent WebHDFS transfers also blocks the other servlets running
inside that same jetty container.
> I am requesting a deny/allow feature be added to WebHDFS.  This is already done with
the Apache HTTPD server, and is what I'd like to see the deny/allow list modeled after.  

This message was sent by Atlassian JIRA

View raw message