cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Connolly <stephen.alan.conno...@gmail.com>
Subject Re: Storing pre-sorted data
Date Thu, 13 Oct 2011 18:20:39 GMT
in theory, however they have less than 32 bits of entropy from which they
can do that, leaving them with at least 32 more bits of combinations to
try... that's 2 billion or so... must be a big dictionary

- Stephen

---
Sent from my Android phone, so random spelling mistakes, random nonsense
words and other nonsense are a direct result of using swype to type on the
screen
On 13 Oct 2011 17:57, "Matthias Pfau" <pfau@l3s.de> wrote:

> Hi Stephen,
> this sounds very reasonable. But wouldn't this enable an attacker to
> execute dictionary attacks in order to "decrypt" the first 8 bytes of the
> plain text?
>
> Kind regards
> Matthias
>
> On 10/13/2011 05:03 PM, Stephen Connolly wrote:
>
>> It wouldn't be unencrypted... which is the point
>>
>> you use a one way linear hash function to take the first, say 8 bytes,
>> of unencrypted data and turn it into 4 bytes of a sort prefix.
>>
>> You've used lost half the data in the process, so effectively each bit
>> is an OR of two bits and you can only infer from 0 values... so data
>> is still encrypted, but you have an approximate sorting.
>>
>> For example, if your data is US-ASCII text with no numbers, you could
>> use Soundex to get the pre-key, so that worst case you have a bucket
>> of values in the range.
>>
>> Using this technique, a random get will have to get the values at the
>> desired prefix +/- a small amount rather than the whole row... on the
>> client side you can then decrypt the data and sort that small bucket
>> to get the correct index position.
>>
>> You could do a 1 byte prefix, but that only gives you at best 256
>> buckets and assumes that the first 2 bytes are uniformly
>> distributed... you've said your data is not uniformly distributed, so
>> a linear hash function sounds like your best bet.
>>
>> your hash function should have the property that hash(A)>= hash(B) if
>> and only if A>= B
>>
>> On 13 October 2011 08:47, Matthias Pfau<pfau@l3s.de>  wrote:
>>
>>> Hi Stephen,
>>> this is a great idea but unfortunately doesn't work for us either as we
>>> can
>>> not store the data in an unencrypted form.
>>>
>>> Kind regards
>>> Matthias
>>>
>>> On 10/12/2011 07:42 PM, Stephen Connolly wrote:
>>>
>>>>
>>>> could you prefix the data with 3-4 bytes of a linear hash of the
>>>> unencypted data? it wouldn't be a perfect sort, but you'd have less of a
>>>> range to query to get the sorted values?
>>>>
>>>> - Stephen
>>>>
>>>> ---
>>>> Sent from my Android phone, so random spelling mistakes, random nonsense
>>>> words and other nonsense are a direct result of using swype to type on
>>>> the screen
>>>>
>>>> On 12 Oct 2011 17:57, "Matthias Pfau"<pfau@l3s.de<mailto:pfau@**l3s.de<pfau@l3s.de>
>>>> >>
>>>> wrote:
>>>>
>>>>    Unfortunately, that is not an option as we have to store the data in
>>>>    an compressed and encrypted and therefore binary and non-sortable
>>>> form.
>>>>
>>>>    On 10/12/2011 06:39 PM, David McNelis wrote:
>>>>
>>>>        Is it an option to not convert the data to binary prior to
>>>> inserting
>>>>        into Cassandra?  Also, how large are the strings you're sorting?
>>>>          If its
>>>>        viable to not convert to binary before writing to Cassandra, and
>>>>        you use
>>>>        one of the string based column ordering techniques (utf8, ascii,
>>>> for
>>>>        example), then the data would be sorted without you  needing to
>>>>        specifically worry about that.  Of course, if the strings are
>>>>        lengthy
>>>>        you could run into  additional issues.
>>>>
>>>>        On Wed, Oct 12, 2011 at 11:34 AM, Matthias Pfau<pfau@l3s.de
>>>>        <mailto:pfau@l3s.de>
>>>>        <mailto:pfau@l3s.de<mailto:pfa**u@l3s.de <pfau@l3s.de>>>>
>>>>  wrote:
>>>>
>>>>            Hi there,
>>>>            we are currently building a prototype based on cassandra and
>>>>        came
>>>>            into problems on implementing sorted lists containing
>>>>        millions of items.
>>>>
>>>>            The special thing about the items of our lists is, that
>>>>        cassandra is
>>>>            not able to sort them as the data is stored in a binary
>>>>        format which
>>>>            is not sortable. However, we are able to sort the data
>>>>        before the
>>>>            plain data gets encoded (our application is responsible for
>>>>        the order).
>>>>
>>>>            First Approach: Storing Lists in ColumnFamilies
>>>>            ***
>>>>            We first tried to map the list to a single row of a
>>>>        ColumnFamily in
>>>>            a way that the index of the list is mapped to the column
>>>>        names and
>>>>            the items of the list to the column values. The column names
>>>> are
>>>>            increasing numbers which define the sort order.
>>>>            This has the major drawback that big parts of the list have
>>>>        to be
>>>>            rewritten on inserts (because the column names are numbered
>>>>        by their
>>>>            index), which are quite common.
>>>>
>>>>
>>>>            Second Approach: Storing the whole List as Binary Data:
>>>>            ***
>>>>            We tried to store the compressed list in a single column.
>>>>        However,
>>>>            this is only feasible for smaller lists. Our lists are far
>>>>        to big
>>>>            leading to multi megabyte reads and writes. As we need to
>>>>        read and
>>>>            update the lists quite often, this would put our Cassandra
>>>>        cluster
>>>>            under a lot of pressure.
>>>>
>>>>            Ideal Solution: Native support for storing lists
>>>>            ***
>>>>            We would be very happy with a way to store a list of sorted
>>>>        values
>>>>            without making improper use of column names for the list
>>>>        index. This
>>>>            implies that we would need a possibility to insert values at
>>>>        defined
>>>>            positions. We know that this could lead to problems with
>>>>        concurrent
>>>>            inserts in a distributed environment, but this is handled by
>>>> our
>>>>            application logic.
>>>>
>>>>
>>>>            What are your ideas on that?
>>>>
>>>>            Thanks
>>>>            Matthias
>>>>
>>>>
>>>>
>>>>
>>>>        --
>>>>        *David McNelis*
>>>>        Lead Software Engineer
>>>>        Agentis Energy
>>>>        www.agentisenergy.com<http://**www.agentisenergy.com<http://www.agentisenergy.com>
>>>> >
>>>>        <http://www.agentisenergy.com>
>>>>        c: 219.384.5143<tel:219.384.5143>
>>>>
>>>>        /A Smart Grid technology company focused on helping consumers of
>>>>        energy
>>>>        control an often under-managed resource./
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>

Mime
View raw message