kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Franco Venturi <fvent...@comcast.net>
Subject Re: Data encryption in Kudu
Date Wed, 03 May 2017 03:38:22 GMT

first of all thanks for reading through my long post and providing your comments and advice.

You are 100% correct on the TDE column encryption in Oracle; I looked it up again in the 'Introduction
to Transparent Data Encryption' in the 'Data Advanced Security Guide' (https://docs.oracle.com/database/121/ASOAG/asotrans.htm#ASOAG10117)
and Figure 2-1 clearly shows the keys being stored in the database. 
With this piece of information, it doesn't seem to me that Oracle column TDE offers much protection
in case of an active attacker who has full access to the the DB server, since there must be
a proces somewhere where the database engine is able to retrieve the decryption key for a
given column. 

Another interesting piece of information in that chapter is this sentence: 

TDE tablespace encryption also allows index range scans on data in encrypted tablespaces.
This is not possible with TDE column encryption. 

which makes me think that TDE column encryption must encrypt the data before placing it into
the Btree, and therefore is not able to use the Btree for range searches. 

I think the main reason why an organization would want one or the other type of encryption
(client-side vs server-side) is what kind of possible attack they are trying to prevent (and
the criteria are often dictated by internal security policies): 
- with server-side encryption, the encrypted data is protected against a disk being lost (the
so called 'encryption at rest'), but it is not protected against an active attacker on the
server with full access (they could retrieve the key and then decrypt the data). 
- with client-side encryption, the server has no way to decrypt the data and therefore even
the active attacker above wouldn't be able to do much with the encrypted data. As I mentioned
in my previous post, this is similar to what HDFS does for transparent data encryption and
I think it's one of their selling points ('not even root can decrypt the data on HDFS'), and
for some IT security groups this may sound attractive. 

I also spent some time in the last two days looking at what MySQL/MariaDB and PostgreSQL do
and this is what I found: 
- MySQL/MariaDB seems to only have table-level encryption (https://mariadb.com/kb/en/mariadb/data-at-rest-encryption/),
and therefore it is of the server-side type 
- PostgreSQL's encryption options (https://www.postgresql.org/docs/9.6/static/encryption-options.html)
list this module 'pgcrypto', that does column-level encryption, but the decryption happens
on the server with the key being provided by the client, hence it looks like a hybrid between
server-side and client-side. 

100% agree with your performance concerns that client-side encryption raises (no range scans
on the encryped columns, no compression, RLE, etc), to the point that last night I wondered
if other people have asked themselves similar questions, and I did find a couple of interesting
- CryptDB (http://css.csail.mit.edu/cryptdb/ - the main paper is here: http://people.csail.mit.edu/nickolai/papers/raluca-cryptdb.pdf)

- ZeroDB (https://opensource.zerodb.com/) 

in order to be able to do range scans, for instance CryptDB uses this 'Order Preserving Encryption',
which in theory allows to encrypt data in a way that preservers ordering, i.e. Enc(x) <
Enc(y) iff x < y; however several research papers after that show that this Order Preserving
Encryption leaks a significant amount of information on the encrypted data and is susceptible
to frequency and other kind of attacks. As you can imagine there's a lot of academic research
actively being done in this field and, even if not ready for prime time, I though I would
share these findings. 

After this long digression (hopefully not too boring), I agree that the way forward would
be to start with looking into the encryption of the file store (I think they are called 'cfiles';
I saw also mentions to some 'delta' files, and I am not sure if they are written the same
way and should be encrypted too), and after that the WALs. 

Oh, one last thing; you asked me: 

Could you elaborate on this? As long as we use an external keystore and intermediate keys,
I don't know how an attacker with access to the on-disk files could decrypt them. 

The scenario I was thinking is of an attacker who has full access to the tablet server; he
can not only read the on-disk files, but he also knows how the tablet server retrieves the
intermediate keys from the external keystore, i.e. he is able to 'impersonate' the tablet
server engine and request the decryption key from wherever it is stored. 


----- Original Message -----

From: "Dan Burkert" <danburkert@apache.org> 
To: user@kudu.apache.org 
Sent: Tuesday, May 2, 2017 2:54:26 PM 
Subject: Re: Data encryption in Kudu 

Hi Franco, 

Thanks for the writeup! I'm not an Oracle expert, but my interpretation of the TDE column
level encryption documentation/implementation is very different than yours. As far as I can
tell, in both the per-column and table-space encryption modes, encryption/decryption is handled
entirely on the Oracle server. The difference is that column-level encryption will encrypt
individual cells on disk (leaving the overall tree/index structure unencrypted), while table-space
level encryption will encrypt at the block or file level. 

I agree with everything you wrote about the tradoffs involved with client vs server encryption,
but I think you are underestimating both the complexity involved with client-side encryption,
as well as the performance hit that it would impose. The loss on encoding, compression, and
range predicate pushdown would absolutely kill performance for many important usecases. The
implementation would also be significantly _more_ difficult than server side encryption, because
the client would need to manage the encryption keys, encrypt/decrypt data, and the solution
would need to be implemented for every client library (of which there are currently two).

For those reasons, I think server side encryption is the way to go with Kudu. I think you're
right that it would slot in as an additional step in the encode -> compress -> encrypt
pipeline for blocks. Because blocks are relatively large (typically > 1 MiB), the overhead
of a 16 byte salt and additional MAC are negligible, so we wouldn't need to force the user
to make that tradeoff. Basically, we could get all of the advantages that Oracle's tablespace
level encryption provides, but on a per-column basis. There are a couple of additional complications
- we also have a WAL that lives outside of our file block abstraction, and we would almost
certainly need to provide encryption for that as well (but perhaps it could be a second step
in the process). 

In-line responses to some other comments below. 

On Sat, Apr 29, 2017 at 8:35 PM, Franco Venturi < fventuri@comcast.net > wrote: 

- also from the security point of view, since the encryption happens at the client side, the
data that is transfered on the network between the client and the server is already encrypted
and there's no need (at least from this point of view) to add a layer of encryption between
client and server 

I'm skeptical of this. For instances, every scan request includes the names and types of the
columns that the client wishes to scan, and that would be in plaintext without wire encryption.
That would be an issue for some usecases. 


- from the security point of view, an attacker with full access to the server would probably
be able to decrypt the encrypted data 


Could you elaborate on this? As long as we use an external keystore and intermediate keys,
I don't know how an attacker with access to the on-disk files could decrypt them. 


- also from a security point of view the server returns the data back in plaintext format;
if the data transferred over the network contains sensitive information, it would need an
extra encryption layer like TLS or something like that 


Correct, and Kudu 1.3 includes TLS wire encryption for exactly this reason. 


- as per performance implications, if the encryption on the server side uses something like
AES192 or AES256, there are libraries like libcrypto that take advantage of the hardware acceleration
for AES encryption on many modern CPUs and therefore I suspect the performance overhead would
be limited; this is also indicated by what the Oracle documentation says regarding processing
overhead in the case of tablespace encryption in TDE 


I agree, I think the overhead of per-block encryption would be pretty minimal. 


- it would also require a way to have the server manage these column encryption keys (possibly
though additional client API's); I haven't looked yet at the way Oracle handles encryption/decryption
keys for the tablespace encryption TDE, but it's on my 'to-do' list 


Yah, the normal thing to do here is call out to an external keystore that holds a master encryption
- Dan 


From: fventuri@comcast.net 
To: user@kudu.apache.org 
Sent: Wednesday, April 26, 2017 9:48:07 PM 

Subject: Re: Data encryption in Kudu 

David, Dan, Todd, 
thanks for your prompt replies. 

At this stage I am just exploring what it would take to implement some sort of data encryption
in Kudu. 

After reading your comments here are some further thoughts: 

- according to the first sentence in this paragraph in the Kudu docs ( https://kudu.apache.org/docs/schema_design.html#compression

Kudu allows per-column compression using the LZ4 , Snappy , or zlib compression codecs. 

it should be possible to perform per-column encryption by adding 'encryption codecs' right
after the compression codecs. I browsed through the code quickly and I think this done when
reading/writing a 'cfile' (please correct me if I am wrong). If this is correct, this change
could be 'minimally invasive' (at least for the 'cfile' part) and would not require a major
overhaul of the Kudu architecture. 

- as per the key management aspect, I am not a security expert at all, so I am not sure what
would be the best approach here - my thought here is that in most places Kudu is deployed
together with HDFS, so it would be 'desirable' if the key management were consistent between
the two services; on the other hand, I also realize that the basic premises are fundamentally
different: HDFS encrypts everything at the client level and therefore the HDFS engine itself
is almost completely unaware that the data it stores is actually encrypted (except for a special
file hidden attribute, if I understand correctly), while in Kudu the storage engine must have
both the 'public' key (when encrypting) and the 'private' key (when decrypting) otherwise
it can't take advantage of knowing the 'structure' of the data (for instance the Bloom filters
wouldn't probably work with the key being encrypted). This means for instance that an attacker
who is able to gain access to the Kudu tablet servers would probably be able to decrypt the
data. Also one way to achieve something similar to what HDFS does (i.e. client-based encryption
and data encrypted in-flight) could be perhaps using a one-time client certificate generated
by the KMS server, but this would also require changes to the client code. 


From: "Todd Lipcon" < todd@cloudera.com > 
To: user@kudu.apache.org 
Sent: Tuesday, April 25, 2017 3:49:50 PM 
Subject: Re: Data encryption in Kudu 

Agreed with what Dan said. 

I think there are a number of interesting design alternatives to be considered, so before
coding it would be great to work through a design document to explore the alternatives. For
example, we could try to apply encryption at the 'fs/' layer, which would cover all non-WAL
data, but then we would lose the ability to specify encryption on a per-column basis. There
are other requirements that need to be ironed out about whether we'd need to support separate
encryption keys per column/table/server/etc, whether metadata also needs to be encrypted,


On Tue, Apr 25, 2017 at 10:38 AM, Dan Burkert < danburkert@apache.org > wrote: 


Hi Franco, 

I think you are right that a client-based approach wouldn't work, because we wouldn't want
to encrypt at the level of individual cell values. That would get in the way of encoding,
compression, predicate evaluation, etc. As you note, adding encryption at the block layer
is probably the way to go. Key management is definitely the tricky issue. We do have one advantage
over HDFS - because Kudu does logical replication, the encryption key can be scoped to a particular
tablet server or tablet replica, it wouldn't need to be shared among all replicas. I haven't
done enough research to know if this makes it fundamentally easier to do key management. I
would assume at a minimum we would want to integrate with key providers such an HSM. It would
be good to have a thorough review of existing solutions in the space, such as TDE and the
Hadoop KMS. Is this something you are interested in working on? 

- Dan 

On Tue, Apr 25, 2017 at 8:30 AM, David Alves < davidralves@gmail.com > wrote: 


Hi Franco 

Dan, Alexey, Todd are our security experts. 
Folks, thoughts on this? 


On Mon, Apr 24, 2017 at 7:08 PM, < fventuri@comcast.net > wrote: 


Over the weekend I started looking at what it would take to add data encryption to Kudu (besides
using filesystem encryption via dm-crypt or something like that). 

Here are a few notes - please feel free to comment on them and add suggestions: 

- reading through this mailing list, it looks like this feature has been asked a couple of
times but last year, but from what I can tell, noone is currently working on it. 
- a client-based approach to encryption like the one used by HDFS wouldn't work (at least
out of the box) because for instance encrypting the primary key at the client would prevent
being able to have range filters for scans; it might work for the columns that are not part
of the primary key 
- there's already code in Kudu for several compression codecs (LZ4, gzip, etc); I thought
it would be possible to add similar code for encryption codecs (to be applied after the compression,
of course) 
- the WAL log files and delta files should be similarly encrypted too 
- not sure what would be the best way to manage the key - I see that in HDFS they use a double
key mechanism, where the encryption key for the data file is itself encrypted with the allowed
user key and this whole process is managed by an external Key Management Service 

Thanks in advance for your ideas and suggestions, 




Todd Lipcon 
Software Engineer, Cloudera 


View raw message