cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Pfau <>
Subject Re: Storing pre-sorted data
Date Thu, 13 Oct 2011 16:57:14 GMT
Hi Stephen,
this sounds very reasonable. But wouldn't this enable an attacker to 
execute dictionary attacks in order to "decrypt" the first 8 bytes of 
the plain text?

Kind regards

On 10/13/2011 05:03 PM, Stephen Connolly wrote:
> It wouldn't be unencrypted... which is the point
> you use a one way linear hash function to take the first, say 8 bytes,
> of unencrypted data and turn it into 4 bytes of a sort prefix.
> You've used lost half the data in the process, so effectively each bit
> is an OR of two bits and you can only infer from 0 values... so data
> is still encrypted, but you have an approximate sorting.
> For example, if your data is US-ASCII text with no numbers, you could
> use Soundex to get the pre-key, so that worst case you have a bucket
> of values in the range.
> Using this technique, a random get will have to get the values at the
> desired prefix +/- a small amount rather than the whole row... on the
> client side you can then decrypt the data and sort that small bucket
> to get the correct index position.
> You could do a 1 byte prefix, but that only gives you at best 256
> buckets and assumes that the first 2 bytes are uniformly
> distributed... you've said your data is not uniformly distributed, so
> a linear hash function sounds like your best bet.
> your hash function should have the property that hash(A)>= hash(B) if
> and only if A>= B
> On 13 October 2011 08:47, Matthias Pfau<>  wrote:
>> Hi Stephen,
>> this is a great idea but unfortunately doesn't work for us either as we can
>> not store the data in an unencrypted form.
>> Kind regards
>> Matthias
>> On 10/12/2011 07:42 PM, Stephen Connolly wrote:
>>> could you prefix the data with 3-4 bytes of a linear hash of the
>>> unencypted data? it wouldn't be a perfect sort, but you'd have less of a
>>> range to query to get the sorted values?
>>> - Stephen
>>> ---
>>> Sent from my Android phone, so random spelling mistakes, random nonsense
>>> words and other nonsense are a direct result of using swype to type on
>>> the screen
>>> On 12 Oct 2011 17:57, "Matthias Pfau"<<>>
>>> wrote:
>>>     Unfortunately, that is not an option as we have to store the data in
>>>     an compressed and encrypted and therefore binary and non-sortable form.
>>>     On 10/12/2011 06:39 PM, David McNelis wrote:
>>>         Is it an option to not convert the data to binary prior to
>>> inserting
>>>         into Cassandra?  Also, how large are the strings you're sorting?
>>>           If its
>>>         viable to not convert to binary before writing to Cassandra, and
>>>         you use
>>>         one of the string based column ordering techniques (utf8, ascii,
>>> for
>>>         example), then the data would be sorted without you  needing to
>>>         specifically worry about that.  Of course, if the strings are
>>>         lengthy
>>>         you could run into  additional issues.
>>>         On Wed, Oct 12, 2011 at 11:34 AM, Matthias Pfau<
>>>         <>
>>>         <<>>>  wrote:
>>>             Hi there,
>>>             we are currently building a prototype based on cassandra and
>>>         came
>>>             into problems on implementing sorted lists containing
>>>         millions of items.
>>>             The special thing about the items of our lists is, that
>>>         cassandra is
>>>             not able to sort them as the data is stored in a binary
>>>         format which
>>>             is not sortable. However, we are able to sort the data
>>>         before the
>>>             plain data gets encoded (our application is responsible for
>>>         the order).
>>>             First Approach: Storing Lists in ColumnFamilies
>>>             ***
>>>             We first tried to map the list to a single row of a
>>>         ColumnFamily in
>>>             a way that the index of the list is mapped to the column
>>>         names and
>>>             the items of the list to the column values. The column names
>>> are
>>>             increasing numbers which define the sort order.
>>>             This has the major drawback that big parts of the list have
>>>         to be
>>>             rewritten on inserts (because the column names are numbered
>>>         by their
>>>             index), which are quite common.
>>>             Second Approach: Storing the whole List as Binary Data:
>>>             ***
>>>             We tried to store the compressed list in a single column.
>>>         However,
>>>             this is only feasible for smaller lists. Our lists are far
>>>         to big
>>>             leading to multi megabyte reads and writes. As we need to
>>>         read and
>>>             update the lists quite often, this would put our Cassandra
>>>         cluster
>>>             under a lot of pressure.
>>>             Ideal Solution: Native support for storing lists
>>>             ***
>>>             We would be very happy with a way to store a list of sorted
>>>         values
>>>             without making improper use of column names for the list
>>>         index. This
>>>             implies that we would need a possibility to insert values at
>>>         defined
>>>             positions. We know that this could lead to problems with
>>>         concurrent
>>>             inserts in a distributed environment, but this is handled by
>>> our
>>>             application logic.
>>>             What are your ideas on that?
>>>             Thanks
>>>             Matthias
>>>         --
>>>         *David McNelis*
>>>         Lead Software Engineer
>>>         Agentis Energy
>>>         <>
>>>         c: 219.384.5143<tel:219.384.5143>
>>>         /A Smart Grid technology company focused on helping consumers of
>>>         energy
>>>         control an often under-managed resource./

View raw message