kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Franco Venturi <fvent...@comcast.net>
Subject Re: Data encryption in Kudu
Date Sun, 30 Apr 2017 03:35:53 GMT

In the last couple of days I did some reading on Oracle TDE (Transparent Data Encryption)
and had some discussions with people at work, and that helped me clarify my ideas about encryption
in Kudu. 

- The most important thing I realized is that there are basically (at least) two orthogonal
ways to achieve 'data encryption' in Kudu (and in Oracle as well): 
- client-side encryption (which Oracle calls 'TDE column encryption') 
- server-side encryption (which Oracle calls 'TDE tablespace encryption') 

- I prefer the terms 'client-side encryption' and 'server-side encryption', because in the
case of Kudu the data is written to disk by columns, and therefore a term like 'column encryption'
might be misleading; also the terms 'client-side 'and 'server-side' help understand IMHO some
of the advantages and drawbacks of each approach and some high level details of how they could
be implemented. 

- During the next items I'll be referring to the Oracle TDE documents; two of them that I
found especially useful are: 
- Oracle Advanced Security Transparent Data Encryption (TDE) FAQ (http://www.oracle.com/technetwork/database/options/advanced-security/overview/advanced-security-tde-faq-2995212.pdf)

- These two chapters in the Oracle Database Advanced Security Guide (https://docs.oracle.com/database/121/ASOAG/toc.htm)

- Introduction to Transparent Data Encryption 
- General Considerations of Using Transparent Data Encryption 
- I also found useful the chapter about 'Transparent Data Encryption' in the Oracle 'Advanced
Security Guide' for Oracle DB version 10.2 (https://docs.oracle.com/cd/B19306_01/network.102/b14268/asotrans.htm#ASOAG600)
because it has a couple of pictures that are not in the newer version. 

Some subnotes about client-side encryption: 

- the idea here is that all the encryption happens on the client and the client sends encrypted
data to the server; the Kudu server engine never sees any plaintext data 
- this point has a few important consequences: 
- the encrypted column (assuming that encryption is on a column-by-column base, which is probably
what most users would need) is always stored a 'byte' type regardless of its original type

- for that encrypted column, the size of each row is going to be somewhat bigger than if that
entry was stored as a plaintext - this is due both to the need to store the 'padding' required
by the block cipher and possibly a MAC for integrity validation (see the Oracle documents
above for more details) 
- for that encrypted column, the RLE or dictionary or other encodings don't bring any advantage
(compression or otherwise) since the data is just random data from the point of view of the
Kudu server. For this kind of column the encoding would have to be just plain natural format

- if the encrypted column is used as a key or in a 'where' clause of a 'select' statement,
several other considerations also apply: 
- range selects on that column are not possible and they would become full scans over that
column (same thing with Oracle) 
- 'range' comparison operators like 'greater than' or 'less than' on that column are not possible
and they would become full scans over that column 
- 'exact' comparison operators like 'equal to' or 'not equal to' on that column are only possible
is the encryption scheme is one-to-one (i.e. if for a given plain text there's only one way
it can be encrypted, which typically means the encryption algorithm cannot make use of a 'salt');
otherwise we go back to full scans of that column 

- given these points, these are the consequences from the performance point of view: 
- since the overhead of the encryption happens on the client side, the performance of the
server itself is not significantly affected by the client-side encryption except for the fact
that it loses any possible advantage that the column encodings could have given (compression,
- however (and this is a big HOWEVER), any selection on an encrypted column that involves
a range (and possibly even an exact selection if the encryption scheme uses a 'salt') becomes
a full column scan (at the client side, since the server is helpless to 'understand' what
the encrypted data mean); this means that a select on a 'big data' table of millions/billions
of rows becomes extremely slow, because for that column all the rows have to be sent back
to the client and the client has to decrypt them and decide which ones satisfy the selection
criteria (and as you can imagine, there are also significant network implications here because
all the entries need to be sent back to the client). 

- also this approach is very similar to what HDFS does for their transparent encryption, and
I would imagine that in this case we could leverage some of the already existing key management
infrastructure offered by HDFS. 

- from the security point of view with client-side encryption, the server has no knowledge
of what the encrypted data actually means, i.e. an attacker on the server itself would not
be able to decrypt the data 
- also from the security point of view, since the encryption happens at the client side, the
data that is transfered on the network between the client and the server is already encrypted
and there's no need (at least from this point of view) to add a layer of encryption between
client and server 

- the practical implementation of client-side encryption would require some minor changes
on the server code; for now I can think of the following: 
- an additional field on each column that indicates if the column is encrypted (the field
could be a 2-byte cipher suite id as defined in RFC 5246 - with the value 0 meaning that the
column is not encrypted) 
- if the column encryption id is not 0, the column would be internally stored as a byte type
and the server would be expected to receive (and send back) byte type data for any entry belonging
to to that column 
- another boolean field on each column that indicates of the encryption scheme is one-to-one
(i.e. it doesn't use a 'salt') or one-to-many 
- to avoid any problems with 'bad' clients that don't understand the limitations above, the
server could return an 'invalid request' error if the client attempts to run a 'range' search
on an encrypted column or an 'exact' search on an encrypted column where the encryption algorithm
is not one-to-one 

- the changes to the client code would instead be substantial - I thought of some of them
but I don't want to make this post even longer than it is now 

Some subnotes about server-side encryption: 

- the idea here is that the encryption happens on the server and that's what I was initially
thinking when I started this thread 
- this could be implemented by adding 'encryption codecs' right after the compression codecs.
This would happen inside the server code when reading/writing a 'cfile' (and hence it is more
or less the equivalent of Oracle tablespace encryption) 
- the server-side encryption would still be at the column level for Kudu because of the way
Kudu writes its data to disk 
- this approach would allow for range searches using B-trees, and would not have any of the
limitations listed above 

- from the security point of view, an attacker with full access to the server would probably
be able to decrypt the encrypted data 
- also from a security point of view the server returns the data back in plaintext format;
if the data transferred over the network contains sensitive information, it would need an
extra encryption layer like TLS or something like that 

- as per performance implications, if the encryption on the server side uses something like
AES192 or AES256, there are libraries like libcrypto that take advantage of the hardware acceleration
for AES encryption on many modern CPUs and therefore I suspect the performance overhead would
be limited; this is also indicated by what the Oracle documentation says regarding processing
overhead in the case of tablespace encryption in TDE 

- of course in this case the major benefit would be that exact 'selects' and range 'selects'
would work exactly like they do now (i.e. they are able to use B-trees and don't require a
full scan of the column); another benefit is that RLE encoding, dictionary encoding, etc work
as expected and offer all their benefits (compression, etc) 

- the implementation of server-side encryption would require on the server more changes than
the client-side encryption (for instance the cfile header may require an additional field
to store the size of the block after compression and before encryption) 

- it would also require a way to have the server manage these column encryption keys (possibly
though additional client API's); I haven't looked yet at the way Oracle handles encryption/decryption
keys for the tablespace encryption TDE, but it's on my 'to-do' list 

- finally, since these two approaches (client-side and server-side) are orthogonal, i.e. independent
of each other, if both were implemented at some time, you could have cases where some (more
security critical) columns are encrypted on the client side, while others (perhaps columns
with less stringent security requirements, and used in 'selects') are encrypted on the server
side (and of course other columns could not be encrypted at all). 

I think this is all for now; thanks for your patience reading though this long post. 


----- Original Message -----

From: fventuri@comcast.net 
To: user@kudu.apache.org 
Sent: Wednesday, April 26, 2017 9:48:07 PM 
Subject: Re: Data encryption in Kudu 

David, Dan, Todd, 
thanks for your prompt replies. 

At this stage I am just exploring what it would take to implement some sort of data encryption
in Kudu. 

After reading your comments here are some further thoughts: 

- according to the first sentence in this paragraph in the Kudu docs ( https://kudu.apache.org/docs/schema_design.html#compression

Kudu allows per-column compression using the LZ4 , Snappy , or zlib compression codecs. 

it should be possible to perform per-column encryption by adding 'encryption codecs' right
after the compression codecs. I browsed through the code quickly and I think this done when
reading/writing a 'cfile' (please correct me if I am wrong). If this is correct, this change
could be 'minimally invasive' (at least for the 'cfile' part) and would not require a major
overhaul of the Kudu architecture. 

- as per the key management aspect, I am not a security expert at all, so I am not sure what
would be the best approach here - my thought here is that in most places Kudu is deployed
together with HDFS, so it would be 'desirable' if the key management were consistent between
the two services; on the other hand, I also realize that the basic premises are fundamentally
different: HDFS encrypts everything at the client level and therefore the HDFS engine itself
is almost completely unaware that the data it stores is actually encrypted (except for a special
file hidden attribute, if I understand correctly), while in Kudu the storage engine must have
both the 'public' key (when encrypting) and the 'private' key (when decrypting) otherwise
it can't take advantage of knowing the 'structure' of the data (for instance the Bloom filters
wouldn't probably work with the key being encrypted). This means for instance that an attacker
who is able to gain access to the Kudu tablet servers would probably be able to decrypt the
data. Also one way to achieve something similar to what HDFS does (i.e. client-based encryption
and data encrypted in-flight) could be perhaps using a one-time client certificate generated
by the KMS server, but this would also require changes to the client code. 


----- Original Message -----

From: "Todd Lipcon" <todd@cloudera.com> 
To: user@kudu.apache.org 
Sent: Tuesday, April 25, 2017 3:49:50 PM 
Subject: Re: Data encryption in Kudu 

Agreed with what Dan said. 

I think there are a number of interesting design alternatives to be considered, so before
coding it would be great to work through a design document to explore the alternatives. For
example, we could try to apply encryption at the 'fs/' layer, which would cover all non-WAL
data, but then we would lose the ability to specify encryption on a per-column basis. There
are other requirements that need to be ironed out about whether we'd need to support separate
encryption keys per column/table/server/etc, whether metadata also needs to be encrypted,


On Tue, Apr 25, 2017 at 10:38 AM, Dan Burkert < danburkert@apache.org > wrote: 

Hi Franco, 

I think you are right that a client-based approach wouldn't work, because we wouldn't want
to encrypt at the level of individual cell values. That would get in the way of encoding,
compression, predicate evaluation, etc. As you note, adding encryption at the block layer
is probably the way to go. Key management is definitely the tricky issue. We do have one advantage
over HDFS - because Kudu does logical replication, the encryption key can be scoped to a particular
tablet server or tablet replica, it wouldn't need to be shared among all replicas. I haven't
done enough research to know if this makes it fundamentally easier to do key management. I
would assume at a minimum we would want to integrate with key providers such an HSM. It would
be good to have a thorough review of existing solutions in the space, such as TDE and the
Hadoop KMS. Is this something you are interested in working on? 

- Dan 

On Tue, Apr 25, 2017 at 8:30 AM, David Alves < davidralves@gmail.com > wrote: 


Hi Franco 

Dan, Alexey, Todd are our security experts. 
Folks, thoughts on this? 


On Mon, Apr 24, 2017 at 7:08 PM, < fventuri@comcast.net > wrote: 


Over the weekend I started looking at what it would take to add data encryption to Kudu (besides
using filesystem encryption via dm-crypt or something like that). 

Here are a few notes - please feel free to comment on them and add suggestions: 

- reading through this mailing list, it looks like this feature has been asked a couple of
times but last year, but from what I can tell, noone is currently working on it. 
- a client-based approach to encryption like the one used by HDFS wouldn't work (at least
out of the box) because for instance encrypting the primary key at the client would prevent
being able to have range filters for scans; it might work for the columns that are not part
of the primary key 
- there's already code in Kudu for several compression codecs (LZ4, gzip, etc); I thought
it would be possible to add similar code for encryption codecs (to be applied after the compression,
of course) 
- the WAL log files and delta files should be similarly encrypted too 
- not sure what would be the best way to manage the key - I see that in HDFS they use a double
key mechanism, where the encryption key for the data file is itself encrypted with the allowed
user key and this whole process is managed by an external Key Management Service 

Thanks in advance for your ideas and suggestions, 



Todd Lipcon 
Software Engineer, Cloudera 

View raw message