kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Data encryption in Kudu
Date Mon, 08 May 2017 19:51:58 GMT
On Fri, May 5, 2017 at 4:54 PM, Dan Burkert <dan@cloudera.com> wrote:

> On Tue, May 2, 2017 at 8:38 PM, Franco Venturi <fventuri@comcast.net>
> wrote:
>> Dan,
>> first of all thanks for reading through my long post and providing your
>> comments and advice.
>> You are 100% correct on the TDE column encryption in Oracle; I looked it
>> up again in the 'Introduction to Transparent Data Encryption' in the 'Data
>> Advanced Security Guide' (https://docs.oracle.com/datab
>> ase/121/ASOAG/asotrans.htm#ASOAG10117) and Figure 2-1 clearly shows the
>> keys being stored in the database.
>> With this piece of information, it doesn't seem to me that Oracle column
>> TDE offers much protection in case of an active attacker who has full
>> access to the the DB server, since there must be a proces somewhere where
>> the database engine is able to retrieve the decryption key for a given
>> column.
> Yes, but this could be in a hardware HSM.
>>  Another interesting piece of information in that chapter is this
>> sentence:
>>                 TDE tablespace encryption also allows index range scans
>> on data in encrypted tablespaces. This is not possible with TDE column
>> encryption.
>> which makes me think that TDE column encryption must encrypt the data
>> before placing it into the Btree, and therefore is not able to use the
>> Btree for range searches.
> That's my interpretation as well.
>> I think the main reason why an organization would want one or the other
>> type of encryption (client-side vs server-side) is what kind of possible
>> attack they are trying to prevent (and the criteria are often dictated by
>> internal security policies):
>>         - with server-side encryption, the encrypted data is protected
>> against a disk being lost (the so called 'encryption at rest'), but it is
>> not protected against an active attacker on the server with full access
>> (they could retrieve the key and then decrypt the data).
>>         - with client-side encryption, the server has no way to decrypt
>> the data and therefore even the active attacker above wouldn't be able to
>> do much with the encrypted data. As I mentioned in my previous post, this
>> is similar to what HDFS does for transparent data encryption and I think
>> it's one of their selling points ('not even root can decrypt the data on
>> HDFS'), and for some IT security groups this may sound attractive.
> Root privileges on a machine doesn't necessary guarantee access to the
> key; the key could be stored remotely, or even on an HSM.
>> 100% agree with your performance concerns that client-side encryption
>> raises (no range scans on the encryped columns, no compression, RLE, etc),
>> to the point that last night I wondered if other people have asked
>> themselves similar questions, and I did find a couple of interesting
>> approaches:
>>         - CryptDB (http://css.csail.mit.edu/cryptdb/ - the main paper is
>> here: http://people.csail.mit.edu/nickolai/papers/raluca-cryptdb.pdf)
>>         - ZeroDB (https://opensource.zerodb.com/)
>> in order to be able to do range scans, for instance CryptDB uses this
>> 'Order Preserving Encryption', which in theory allows to encrypt data in a
>> way that preservers ordering, i.e. Enc(x) < Enc(y) iff x < y; however
>> several research papers after that show that this Order Preserving
>> Encryption leaks a significant amount of information on the encrypted data
>> and is susceptible to frequency and other kind of attacks. As you can
>> imagine there's a lot of academic research actively being done in this
>> field and, even if not ready for prime time, I though I would share these
>> findings.
> That's really interesting.  Pretty different threat model being assumed by
> ZeroDB :).
>> After this long digression (hopefully not too boring), I agree that the
>> way forward would be to start with looking into the encryption of the file
>> store (I think they are called 'cfiles'; I saw also mentions to some
>> 'delta' files, and I am not sure if they are written the same way and
>> should be encrypted too), and after that the WALs.
> Yah, I think cfiles are a good place to start.  AFAIK delta files reuse
> the cfile machinery when writing to disk. I originally considered
> recommending looking at the filesystem block manager, but we often do
> offset lookups into the FS blocks, which I don't think could be supported
> with encryption.

I think it could be -- if you use CTR mode for encryption, you can support
random access, right?

However, I do think it makes sense to consider column-level encryption
keys/policies in which case it may be easier to do at a higher level.
Though, it may be possible for the higher level to just pass down a key ID
into the FS layer when writing a file, so that the policy can be set at a
high level while the implementation is done at a lower one.

> - Dan
> ------------------------------
>> *From: *"Dan Burkert" <danburkert@apache.org>
>> *To: *user@kudu.apache.org
>> *Sent: *Tuesday, May 2, 2017 2:54:26 PM
>> *Subject: *Re: Data encryption in Kudu
>> Hi Franco,
>> Thanks for the writeup!  I'm not an Oracle expert, but my interpretation
>> of the TDE column level encryption documentation/implementation is very
>> different than yours.  As far as I can tell, in both the per-column and
>> table-space encryption modes, encryption/decryption is handled entirely on
>> the Oracle server.  The difference is that column-level encryption will
>> encrypt individual cells on disk (leaving the overall tree/index structure
>> unencrypted), while table-space level encryption will encrypt at the block
>> or file level.
>> I agree with everything you wrote about the tradoffs involved with client
>> vs server encryption, but I think you are underestimating both the
>> complexity involved with client-side encryption, as well as the performance
>> hit that it would impose.  The loss on encoding, compression, and range
>> predicate pushdown would absolutely kill performance for many important
>> usecases.  The implementation would also be significantly _more_ difficult
>> than server side encryption, because the client would need to manage the
>> encryption keys, encrypt/decrypt data, and the solution would need to be
>> implemented for every client library (of which there are currently two).
>> For those reasons, I think server side encryption is the way to go with
>> Kudu.  I think you're right that it would slot in as an additional step in
>> the encode -> compress -> encrypt pipeline for blocks.  Because blocks are
>> relatively large (typically > 1 MiB), the overhead of a 16 byte salt and
>> additional MAC are negligible, so we wouldn't need to force the user to
>> make that tradeoff.  Basically, we could get all of the advantages that
>> Oracle's tablespace level encryption provides, but on a per-column basis.
>> There are a couple of additional complications - we also have a WAL that
>> lives outside of our file block abstraction, and we would almost certainly
>> need to provide encryption for that as well (but perhaps it could be a
>> second step in the process).
>> In-line responses to some other comments below.
>> On Sat, Apr 29, 2017 at 8:35 PM, Franco Venturi <fventuri@comcast.net>
>> wrote:
>>> - also from the security point of view, since the encryption happens at
>>> the client side, the data that is transfered on the network between the
>>> client and the server is already encrypted and there's no need (at least
>>> from this point of view) to add a layer of encryption between client and
>>> server
>> I'm skeptical of this.  For instances, every scan request includes the
>> names and types of the columns that the client wishes to scan, and that
>> would be in plaintext without wire encryption.  That would be an issue for
>> some usecases.
>>> - from the security point of view, an attacker with full access to the
>>> server would probably be able to decrypt the encrypted data
>> Could you elaborate on this?  As long as we use an external keystore and
>> intermediate keys, I don't know how an attacker with access to the on-disk
>> files could decrypt them.
>>> - also from a security point of view the server returns the data back in
>>> plaintext format; if the data transferred over the network contains
>>> sensitive information, it would need an extra encryption layer like TLS or
>>> something like that
>> Correct, and Kudu 1.3 includes TLS wire encryption for exactly this
>> reason.
>>> - as per performance implications, if the encryption on the server side
>>> uses something like AES192 or AES256, there are libraries like libcrypto
>>> that take advantage of the hardware acceleration for AES encryption on many
>>> modern CPUs and therefore I suspect the performance overhead would be
>>> limited; this is also indicated by what the Oracle documentation says
>>> regarding processing overhead in the case of tablespace encryption in TDE
>> I agree, I think the overhead of per-block encryption would be pretty
>> minimal.
>>> - it would also require a way to have the server manage these column
>>> encryption keys (possibly though additional client API's); I haven't looked
>>> yet at the way Oracle handles encryption/decryption keys for the tablespace
>>> encryption TDE, but it's on my 'to-do' list
>> Yah, the normal thing to do here is call out to an external keystore that
>> holds a master encryption key.
>> - Dan
>> ------------------------------
>>> *From: *fventuri@comcast.net
>>> *To: *user@kudu.apache.org
>>> *Sent: *Wednesday, April 26, 2017 9:48:07 PM
>>> *Subject: *Re: Data encryption in Kudu
>>> David, Dan, Todd,
>>> thanks for your prompt replies.
>>> At this stage I am just exploring what it would take to implement some
>>> sort of data encryption in Kudu.
>>> After reading your comments here are some further thoughts:
>>> - according to the first sentence in this paragraph in the Kudu docs (
>>> https://kudu.apache.org/docs/schema_design.html#compression):
>>>          Kudu allows per-column compression using the LZ4, Snappy, or
>>> zlib compression codecs.
>>> it should be possible to perform per-column encryption by adding
>>> 'encryption codecs' right after the compression codecs. I browsed through
>>> the code quickly and I think this done when reading/writing a 'cfile'
>>> (please correct me if I am wrong). If this is correct, this change could be
>>> 'minimally invasive' (at least for the 'cfile' part) and would not require
>>> a major overhaul of the Kudu architecture.
>>> - as per the key management aspect, I am not a security expert at all,
>>> so I am not sure what would be the best approach here - my thought here is
>>> that in most places Kudu is deployed together with HDFS, so it would be
>>> 'desirable' if the key management were consistent between the two services;
>>> on the other hand, I also realize that the basic premises are fundamentally
>>> different: HDFS encrypts everything at the client level and therefore the
>>> HDFS engine itself is almost completely unaware that the data it stores is
>>> actually encrypted (except for a special file hidden attribute, if I
>>> understand correctly), while in Kudu the storage engine must have both the
>>> 'public' key (when encrypting) and the 'private' key (when decrypting)
>>> otherwise it can't take advantage of knowing the 'structure' of the data
>>> (for instance the Bloom filters wouldn't probably work with the key being
>>> encrypted). This means for instance that an attacker who is able to gain
>>> access to the Kudu tablet servers would probably be able to decrypt the
>>> data. Also one way to achieve something similar to what HDFS does (i.e.
>>> client-based encryption and data encrypted in-flight) could be perhaps
>>> using a one-time client certificate generated by the KMS server, but this
>>> would also require changes to the client code.
>>> Franco
>>> ------------------------------
>>> *From: *"Todd Lipcon" <todd@cloudera.com>
>>> *To: *user@kudu.apache.org
>>> *Sent: *Tuesday, April 25, 2017 3:49:50 PM
>>> *Subject: *Re: Data encryption in Kudu
>>> Agreed with what Dan said.
>>> I think there are a number of interesting design alternatives to be
>>> considered, so before coding it would be great to work through a design
>>> document to explore the alternatives. For example, we could try to apply
>>> encryption at the 'fs/' layer, which would cover all non-WAL data, but then
>>> we would lose the ability to specify encryption on a per-column basis.
>>> There are other requirements that need to be ironed out about whether we'd
>>> need to support separate encryption keys per column/table/server/etc,
>>> whether metadata also needs to be encrypted, etc.
>>> -Todd
>>> On Tue, Apr 25, 2017 at 10:38 AM, Dan Burkert <danburkert@apache.org>
>>> wrote:
>>>> Hi Franco,
>>>> I think you are right that a client-based approach wouldn't work,
>>>> because we wouldn't want to encrypt at the level of individual cell
>>>> values.  That would get in the way of encoding, compression, predicate
>>>> evaluation, etc.  As you note, adding encryption at the block layer is
>>>> probably the way to go.  Key management is definitely the tricky issue. 
>>>> do have one advantage over HDFS - because Kudu does logical replication,
>>>> the encryption key can be scoped to a particular tablet server or tablet
>>>> replica, it wouldn't need to be shared among all replicas.  I haven't done
>>>> enough research to know if this makes it fundamentally easier to do key
>>>> management.  I would assume at a minimum we would want to integrate with
>>>> key providers such an HSM.  It would be good to have a thorough review of
>>>> existing solutions in the space, such as TDE
>>>> <https://en.wikipedia.org/wiki/Transparent_Data_Encryption> and the
>>>> Hadoop KMS.  Is this something you are interested in working on?
>>>> - Dan
>>>> On Tue, Apr 25, 2017 at 8:30 AM, David Alves <davidralves@gmail.com>
>>>> wrote:
>>>>> Hi Franco
>>>>>   Dan, Alexey, Todd are our security experts.
>>>>>   Folks, thoughts on this?
>>>>> Best
>>>>> David
>>>>> On Mon, Apr 24, 2017 at 7:08 PM, <fventuri@comcast.net> wrote:
>>>>>> Over the weekend I started looking at what it would take to add data
>>>>>> encryption to Kudu (besides using filesystem encryption via dm-crypt
>>>>>> something like that).
>>>>>> Here are a few notes - please feel free to comment on them and add
>>>>>> suggestions:
>>>>>> - reading through this mailing list, it looks like this feature has
>>>>>> been asked a couple of times but last year, but from what I can tell,
>>>>>> is currently working on it.
>>>>>> - a client-based approach to encryption like the one used by HDFS
>>>>>> wouldn't work (at least out of the box) because for instance encrypting
>>>>>> primary key at the client would prevent being able to have range
>>>>>> for scans; it might work for the columns that are not part of the
>>>>>> key
>>>>>> - there's already code in Kudu for several compression codecs (LZ4,
>>>>>> gzip, etc); I thought it would be possible to add similar code for
>>>>>> encryption codecs (to be applied after the compression, of course)
>>>>>> - the WAL log files and delta files should be similarly encrypted
>>>>>> - not sure what would be the best way to manage the key - I see that
>>>>>> in HDFS they use a double key mechanism, where the encryption key
for the
>>>>>> data file is itself encrypted with the allowed user key and this
>>>>>> process is managed by an external Key Management Service
>>>>>> Thanks in advance for your ideas and suggestions,
>>>>>> Franco
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera

Todd Lipcon
Software Engineer, Cloudera

View raw message