lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Renaud Delbru (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
Date Fri, 08 Jan 2016 10:12:39 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089007#comment-15089007
] 

Renaud Delbru commented on LUCENE-6966:
---------------------------------------

I agree with you that if we add encryption to Lucene, it should always be secure. That's why
I opened up the discussion with the commnunity in order to review and agree on which approach
to adopt. 
With respect to IV reuse with CBC mode, a potential leak of information occurs when two messages
share a common prefix, as it will reveal the presence and length of that prefix.
Now if we look at each format separately and at what type of messages is encrypted in each
one, we can assess the risk:
- Term Dictionary Index: the entire term dictionary index in a segment will be encrypted as
one single message - risk is null
- Term Dictionary Data: each suffixes bytes blob is encrypted as one message - I would assume
that the probability of having two suffixes bytes blobs sharing the same prefix or being identical
is pretty low. But I might be wrong.
- Stored Fields Format: each compressed doc chunk is encrypted as one message - a doc chunk
can contain the exact same data (e.g., if multiple documents contain the same exact fields
and values). This is more likely to happen but it sounds like more an edge case.
- Terms Vector: each compress terms and payloads bytes blob of doc chunk is encrypted as one
message - same issue than with Stored Fields Format

The risk of reusing IV seems to reside in Stored Fields / Terms Vector is not acceptable,
one solution is to add a random generated header to each compressed doc chunk that will serve
as a unique IV. What do you think ?

> Contribution: Codec for index-level encryption
> ----------------------------------------------
>
>                 Key: LUCENE-6966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6966
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/other
>            Reporter: Renaud Delbru
>              Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive data in
the index that has been developed as part of an engagement with a customer. We think that
this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system encryption,
index output / directory encryption), encryption at a codec level enables more fine-grained
control on which block of data is encrypted. This is more efficient since less data has to
be encrypted. This also gives more flexibility such as the ability to select which field to
encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a new version
of his encryption key. Multiple key versions should co-exist in one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this will be submitted
as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest key version
available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have multiple
segments, each one encrypted using a different key version. The key version for a segment
is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an implementation
of the cipher factory. The cipher factory is responsible of the creation of a cipher instance
based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector (IV) is
reused for performance reason, but only on a per format and per segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow resilient
to IV reuse. The only "leak" of information that this could lead to is being able to know
that two encrypted blocks of data starts with the same prefix. However, it is unlikely that
two data blocks in an index segment will start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block (~4kb) of one
or more documents. It is unlikely that two compressed blocks start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of terms and payloads
from one or more documents. It is unlikely that two compressed blocks start with the same
data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted in one single
data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set of suffixes.
It is unlikely to have two dictionary data blocks sharing the same prefix within the same
segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data blocks. It
is unlikely to have two data blocks sharing the same prefix within the same segment (each
one will encodes a list of values associated to a field).
> To the best of our knowledge, this model should be safe. However, it would be good if
someone with security expertise in the community could review and validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on Lucene 4.x.
The benchmark was performed on the Wikipedia dataset where all the fields (id, title, body,
date) were encrypted. Only the block tree terms and compressed stored fields format were tested
at that time. 
> h2. Indexing
> The indexing throughput slightly decreased and is roughly 15% less than with the base
Lucene. 
> The merge time slightly increased by 35%.
> There was no significant difference in term of index size.
> h2. Query Throughput
> With respect to query throughput, we observed no significant impact on the following
queries: Term query, boolean query, phrase query, numeric range query. 
> We observed the following performance impact for queries that needs to scan a larger
portion of the term dictionary:
> - prefix query: decrease of ~25%
> - wildcard query (e.g., “fu*r”): decrease of ~60%
> - fuzzy query (distance 1): decrease of ~40%
> - fuzzy query (distance 2): decrease of ~80%
> We can see that the decrease of performance is relative to the size of the dictionary
scan.
> h2. Document Retrieval
> We observed a decrease of performance that is relative to the size of the set of documents
to be retrieved:
> - ~20% when retrieving a medium set of documents (100) 
> - ~30/40% when retrieving a large set of documents (1000) 
> h1. Known Limitations
> - compressed stored field do not keep order of fields since non-encrypted and encrypted
fields are stored in separated blocks.
> - the current implementation of the cipher factory does not enforce the use of AES/CBC.
We are planning to add this to the final version of the patch.
> - the current implementation does not change the IV per segment. We are planning to add
this to the final version of the patch.
> - the current implementation of compressed stored fields decrypts a full compressed block
even if a small portion is decompressed (high impact when storing very small documents). We
are planning to add this optimisation to the final version of the patch. The overall document
retrieval performance might increase with this optimisation.
> The codec has been implemented as a contrib. Given that most of the classes were final,
we had to copy most of the original code from the extended formats. At a later stage, we could
think of opening some of these classes to extend them properly in order to reduce code duplication
and simplify code maintenance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message