hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yi Liu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-10150) Hadoop cryptographic file system
Date Tue, 25 Mar 2014 17:38:30 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-10150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13946828#comment-13946828
] 

Yi Liu commented on HADOOP-10150:
---------------------------------

Thanks [~tucu00] for your comment.
We less concern the internal use of HDFS client, on the contrary we care more about encrypted
data easy for clients. Even though we found that in webhdfs it should use DistributedFileSystem
as well to remove the symlink issue as HDFS-4933 stated(The issue we found is “Throwing
UnresolvedPathException when getting HDFS symlink file through HDFS REST API”, and there
is no “statistics” for HDFS REST which is inconsistent with behavior of DistributedFileSystem,
suppose this JIRA will resolve it).

“Transparent” or “at rest” encryption usually means that the server handles encrypting
data for persistence, but does not manage keys for particular clients or applications, nor
require applications to even be aware that encryption is in use. Hence how it can be described
as transparent. This type of solution distributes secret keys within the secure enclave (not
to clients), or might employ a two tier key architecture (data keys wrapped by the cluster
secret key) but with keys managed per application typically. E.g. in a database system, per
table. The goal here is to avoid data leakage from the server by universally encrypting data
“at rest”.

Other cryptographic application architectures handle use cases where clients or applications
want to protect data with encryption from other clients or applications. For those use cases
encryption and decryption is done on the client, and the scope of key sharing should be minimized
to where the cryptographic operations take place. In this type of solution the server becomes
an unnecessary central point of compromise for user or application keys, so sharing there
should be avoided. This isn’t really an “at rest” solution because the client may or
may not choose to encrypt, and because key sharing is minimized, the server cannot and should
not be able to distinguish encrypted data from random bytes, so cannot guarantee all persisted
data is encrypted.

Therefore we have two different types of solutions useful for different reasons, with different
threat models. Combinations of the two must be carefully done (or avoided) so as not to end
up with something combining the worst of both threat models.

HDFS-6134 and HADOOP-10150 are orthogonal and complimentary solutions when viewed in this
light. HDFS-6134, as described at least by the JIRA title, wants to introduce transparent
encryption within HDFS. In my opinion, it shouldn’t attempt “client side encryption on
the server” for reasons mentioned above. HADOOP-10150 wants to make management of partially
encrypted data easy for clients, for the client side encryption use cases, by presenting a
filtered view over base Hadoop filesystems like HDFS.

{quote} in the "Storage of IV and data key" is stated "So we implement extended information
based on INode feature, and use it to store data key and IV. "{quote}
We assume HDFS-2006 could help, that’s why we put separate patches. In that the CFS patch
it was decoupled with underlying filesystem if xattr present. And it could be end user’s
choice to decide whether store key alias or data encryption key.

{quote}(Mentioned before), how thing flush() operations will be handled as the encryption
block will be cut short? How this is handled on writes? How this is handled on reads?{quote}
For hflush, hsync, actually it's very simple. In cryptographic output stream of CFS, we buffer
the plain text in cache and do encryption until data size reaches buffer length to improve
performance. So for hflush /hsync, we just need to flush the buffer and do encryption immediately,
and then call FSDataOutputStream.hfulsh/hsync which will handle the remaining thing.

{quote}Still, it is not clear how transparency will be achieved for existing applications:
HDFS URI changes, clients must connect to the Key store to retrieve the encryption key (clients
will need key store principals). The encryption key must be propagated to jobs tasks (i.e.
Mapper/Reducer processes){quote}
There is no URL changed, please see latest design doc and test case.
We have considered HADOOP-9534 and HADOOP-10141, encryption of key material could be handled
by the implementation of key providers according to customers environment.

{quote}Use of AES-CTR (instead of an authenticated encryption mode such as AES-GCM){quote}
AES-GCM was introduce addition CPU cycles by GHASH - 2.5x additional cycles in Sandy-Bridge
and Ivy-Bridge, 0.6x additional cycle in Haswell. Data integrity was ensured by underlying
filesystem like HDFS in this scenario. We decide to use AES-CTR for best performance.
Furthermore, AES-GCM mode is not available as a JCE cipher in Java 6. It may be EOL but plenty
of Hadoopers are still running it. It's not even listed on the Java 7 Sun provider document
(http://docs.oracle.com/javase/7/docs/technotes/guides/security/SunProviders.html) but that
may be an omission.

{quote}By looking at the latest design doc of HADOOP-10150 I can see that things have been
modified a bit (from the original design doc) bringing it a bit closer to some of the HDFS-6134
requirements.{quote}
Actually we designed like this much earlier before we updated, just look at the patch.

{quote}Definitely, I want to work together with you guys to leverage as much as posible. Either
by unifying the 2 proposal or by sharing common code if we think both approaches have merits
and we decide to move forward with both.{quote}
I agree.

{quote}Restrictions of move operations for files within an encrypted directory. The original
design had something about it (not entirely correct), now is gone{quote}
Rename is atomic operation in Hadoop, so we only allow move between one directory/file and
another directory/file if they share same data key, then no decryption is required. Please
see my MAR/21 patch.
Actually we have not mentioned rename in the earlier doc, we just discussed it in review comments,
since @Steve had the same questions, and we covered this in the comments of discussion with
him.

{quote}Explicit auditing on encrypted files access does not seem handled{quote}
The auditing could be another topic we need to address especially when discussing the client
side encryption. One possible way is to add a pluggable point that customer can route audit
event to their existing auditing system. On that above points discussion conclusion we think
on this point later.


> Hadoop cryptographic file system
> --------------------------------
>
>                 Key: HADOOP-10150
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10150
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: security
>    Affects Versions: 3.0.0
>            Reporter: Yi Liu
>            Assignee: Yi Liu
>              Labels: rhino
>             Fix For: 3.0.0
>
>         Attachments: CryptographicFileSystem.patch, HADOOP cryptographic file system-V2.docx,
HADOOP cryptographic file system.pdf, cfs.patch, extended information based on INode feature.patch
>
>
> There is an increasing need for securing data when Hadoop customers use various upper
layer applications, such as Map-Reduce, Hive, Pig, HBase and so on.
> HADOOP CFS (HADOOP Cryptographic File System) is used to secure data, based on HADOOP
“FilterFileSystem” decorating DFS or other file systems, and transparent to upper layer
applications. It’s configurable, scalable and fast.
> High level requirements:
> 1.	Transparent to and no modification required for upper layer applications.
> 2.	“Seek”, “PositionedReadable” are supported for input stream of CFS if the
wrapped file system supports them.
> 3.	Very high performance for encryption and decryption, they will not become bottleneck.
> 4.	Can decorate HDFS and all other file systems in Hadoop, and will not modify existing
structure of file system, such as namenode and datanode structure if the wrapped file system
is HDFS.
> 5.	Admin can configure encryption policies, such as which directory will be encrypted.
> 6.	A robust key management framework.
> 7.	Support Pread and append operations if the wrapped file system supports them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message