hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jerry Chen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection
Date Wed, 17 Oct 2012 05:46:05 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477634#comment-13477634

Jerry Chen commented on MAPREDUCE-4491:

Hi Benoy,
I am Haifeng from Intel. and we was discussing offline as to this feature. And I really apperciate
your initiation of this work. And we also see the importance of encryption and decryption
in Hadoop when we are deasling with sensitive data. 

Just as you pointed out, the functionalities requirements are more or less same. For hadoop
community, we wish to get a high level abstraction that basically provide a foundation for
these requirements in different hadoop components (such as HDFS, MapReduce, HBase) while enable
different implementations such as different encryption algorithms or different ways of key
management of different parts / companies so that not bounding a concept on a specific implementation.
 Just as we disuccssed offline, the driving force for such a abstraction is summarized  as

1. Encryption and decryption need to be supported in different components and usage models.
For example, We may use HDFS Client API and Codec directly to encrypt and decrypt HDFS file;
We may use MapReduce to processing a encrypted file and output a encrypted file; And also,
the HBase may needs to store its files (such as hfiles) in an encrypted way.

2. The community may have different implemenation of encryption codecs and different ways
of providing keys. CompressionCodec provides us a foundation for related work. But CompressionCodec
are not enough for encryption and decryption because CompressionCodec assumes to initilize
from hadoop Configuration while encryption/decryption may needs a per file crypto context
such as the Key. With an abtraction layer of crypto, we can share the common featurs such
as "Provide different keys for different input files of a MapReduce job." other than each
implementation get his own way in MapReduce core and finally becames into a mess.

Based on these driving forces, your work done and our offline discussions, we refined our
work and would like to propose the following,

1. For Hadoop common, a new CryptoCodec interface which extends CompressionCodec, which adding
the methods of getCryptoContext/setCryptoContext. Just as CompressionCodec, it will initialize
its global settings from Configuration. But CryptoCodec will receive its crypto context (the
Key, for example) through CryptoContext object setting by setCryptoContext, allowing different
usage cases such as "direct use CryptoCodec to encrypt/decrypt a HDFS file by direct providing
the CryptoContext(Key)" or "Map Reduce way of using CryptoCodec that a CryptoContext(Key)
is choosed per file based on some policy".

Any specific crypto implementation are under this umbrella and will implement CryptoCodec.
The PGPCodec is pretty good fit into a implementation of CryptoCodec. And we also are able
to implements our splittable CryptoCodec.

2. For MapReduce, use CryptoContextProvider interface to abstract implementation specific
service and allowing the MapReduce core is able to written shared code of retrieveing the
CryptoContext of a specific file from a CryptoContextProvider and pass to the CryptoCodec
in using. Different CryptoContextProvider implementations can implement different ways of
deciding the CryptoContext and different ways of retrieving Keys from different Key Stores.
We can provide basic and common implementations of CryptoContextProviders such as "A CryptoContextProvider
provides CryptoContext for a file by regular expression matching the file path and get the
key from a java KeyStore" while not preventing users to implement or extends their own if
existing implementation doesn't satisfy their requirements.

CryptoContextProvider configurations are passed by hadoop JobConfig and credentials (credential
secret keys) and the implementation of CryptoContextProvider can choose whether or not to
encrypt the secret keys stored in job Credentials.

I attched the java files of these interfaces and basic strucutes in Attachments section for
demonstrating the concepts and I wish to have a design document for these high level things
when we have enough discussion and come to an agreement.

Again, thanks for your patient and time. 

> Encryption and Key Protection
> -----------------------------
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
> When dealing with sensitive data, it is required to keep the data encrypted wherever
it is stored. Common use case is to pull encrypted data out of a datasource and store in HDFS
for analysis. The keys are stored in an external keystore. 
> The feature adds a customizable framework to integrate different types of keystores,
support for Java KeyStore, read keys from keystores, and transport keys from JobClient to
> The feature adds PGP encryption as a codec and additional utilities to perform encryption
related steps.
> The design document is attached. It explains the requirement, design and use cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial work for
further refinement.
> Update: The patches are uploaded to subtasks. 

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message