cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Connolly <stephen.alan.conno...@gmail.com>
Subject Re: Storing pre-sorted data
Date Fri, 21 Oct 2011 15:52:37 GMT
Well you could use a DOUBLE value to encode relative positions...

first item in list gets key 1.0
insert before first item -> key[first]/2.0;
append after last item -> key[last]*2.0;
insert after non-final item -> (key[n]+key[n+1])/2.0

Using double precision should give you quite a space to fit items...

you should be able to cleanly do 2^10 appends, or 2^10 insert first
before hitting the significands... and even then you have 2^51 of
those...

If you start to hit issues like heavy segments, you can re-normalize the row.

-Stephen

On 17 October 2011 10:43, Matthias Pfau <pfau@l3s.de> wrote:
> Thanks for that hint! However, it seems like soundex is a very language
> specific algorithm (US English). We have to get into this topic further...
>
> Kind regards
> Matthias
>
> On 10/13/2011 10:43 PM, Stephen Connolly wrote:
>>
>> Then just use a soundex function on the first word in the text... that
>> will shrink it sufficiently and give nice buckets in near sequential
>> order (http://en.wikipedia.org/wiki/Soundex)
>>
>> On 13 October 2011 21:21, Matthias Pfau<pfau@l3s.de>  wrote:
>>>
>>> Hi Stephen,
>>> we are hashing the first 8 byte (8 US-ASCII characters) of text that has
>>> been written by humans. Wouldn't it be easy for the attacker to do a
>>> dictionary attack on this text, especially if he knows the language of
>>> the
>>> text?
>>>
>>> Kind regards
>>> Matthias
>>>
>>> On 10/13/2011 08:20 PM, Stephen Connolly wrote:
>>>>
>>>> in theory, however they have less than 32 bits of entropy from which
>>>> they can do that, leaving them with at least 32 more bits of
>>>> combinations to try... that's 2 billion or so... must be a big
>>>> dictionary
>>>>
>>>> - Stephen
>>>>
>>>> ---
>>>> Sent from my Android phone, so random spelling mistakes, random nonsense
>>>> words and other nonsense are a direct result of using swype to type on
>>>> the screen
>>>>
>>>> On 13 Oct 2011 17:57, "Matthias Pfau"<pfau@l3s.de<mailto:pfau@l3s.de>>
>>>> wrote:
>>>>
>>>>    Hi Stephen,
>>>>    this sounds very reasonable. But wouldn't this enable an attacker to
>>>>    execute dictionary attacks in order to "decrypt" the first 8 bytes
>>>>    of the plain text?
>>>>
>>>>    Kind regards
>>>>    Matthias
>>>>
>>>>    On 10/13/2011 05:03 PM, Stephen Connolly wrote:
>>>>
>>>>        It wouldn't be unencrypted... which is the point
>>>>
>>>>        you use a one way linear hash function to take the first, say
8
>>>>        bytes,
>>>>        of unencrypted data and turn it into 4 bytes of a sort prefix.
>>>>
>>>>        You've used lost half the data in the process, so effectively
>>>>        each bit
>>>>        is an OR of two bits and you can only infer from 0 values... so
>>>> data
>>>>        is still encrypted, but you have an approximate sorting.
>>>>
>>>>        For example, if your data is US-ASCII text with no numbers, you
>>>>        could
>>>>        use Soundex to get the pre-key, so that worst case you have a
>>>> bucket
>>>>        of values in the range.
>>>>
>>>>        Using this technique, a random get will have to get the values
>>>>        at the
>>>>        desired prefix +/- a small amount rather than the whole row...
>>>>        on the
>>>>        client side you can then decrypt the data and sort that small
>>>> bucket
>>>>        to get the correct index position.
>>>>
>>>>        You could do a 1 byte prefix, but that only gives you at best
256
>>>>        buckets and assumes that the first 2 bytes are uniformly
>>>>        distributed... you've said your data is not uniformly
>>>>        distributed, so
>>>>        a linear hash function sounds like your best bet.
>>>>
>>>>        your hash function should have the property that hash(A)>=
>>>>        hash(B) if
>>>>        and only if A>= B
>>>>
>>>>        On 13 October 2011 08:47, Matthias Pfau<pfau@l3s.de
>>>>        <mailto:pfau@l3s.de>>    wrote:
>>>>
>>>>            Hi Stephen,
>>>>            this is a great idea but unfortunately doesn't work for
us
>>>>            either as we can
>>>>            not store the data in an unencrypted form.
>>>>
>>>>            Kind regards
>>>>            Matthias
>>>>
>>>>            On 10/12/2011 07:42 PM, Stephen Connolly wrote:
>>>>
>>>>
>>>>                could you prefix the data with 3-4 bytes of a linear
>>>>                hash of the
>>>>                unencypted data? it wouldn't be a perfect sort, but
>>>>                you'd have less of a
>>>>                range to query to get the sorted values?
>>>>
>>>>                - Stephen
>>>>
>>>>                ---
>>>>                Sent from my Android phone, so random spelling mistakes,
>>>>                random nonsense
>>>>                words and other nonsense are a direct result of using
>>>>                swype to type on
>>>>                the screen
>>>>
>>>>                On 12 Oct 2011 17:57, "Matthias Pfau"<pfau@l3s.de
>>>>                <mailto:pfau@l3s.de><mailto:pfau@__l3s.de
>>>>                <mailto:pfau@l3s.de>>>
>>>>                wrote:
>>>>
>>>>                    Unfortunately, that is not an option as we have
to
>>>>                store the data in
>>>>                    an compressed and encrypted and therefore binary
and
>>>>                non-sortable form.
>>>>
>>>>                    On 10/12/2011 06:39 PM, David McNelis wrote:
>>>>
>>>>                        Is it an option to not convert the data
to
>>>>                binary prior to
>>>>                inserting
>>>>                        into Cassandra?  Also, how large are
the strings
>>>>                you're sorting?
>>>>                          If its
>>>>                        viable to not convert to binary before
writing
>>>>                to Cassandra, and
>>>>                        you use
>>>>                        one of the string based column ordering
>>>>                techniques (utf8, ascii,
>>>>                for
>>>>                        example), then the data would be sorted
without
>>>>                you  needing to
>>>>                        specifically worry about that.  Of course,
if
>>>>                the strings are
>>>>                        lengthy
>>>>                        you could run into  additional issues.
>>>>
>>>>                        On Wed, Oct 12, 2011 at 11:34 AM, Matthias
>>>>                Pfau<pfau@l3s.de<mailto:pfau@l3s.de>
>>>>                <mailto:pfau@l3s.de<mailto:pfau@l3s.de>>
>>>>                <mailto:pfau@l3s.de
>>>>                <mailto:pfau@l3s.de><mailto:pfa__u@l3s.de
>>>>                <mailto:pfau@l3s.de>>>>    wrote:
>>>>
>>>>                            Hi there,
>>>>                            we are currently building a prototype
based
>>>>                on cassandra and
>>>>                        came
>>>>                            into problems on implementing sorted
lists
>>>>                containing
>>>>                        millions of items.
>>>>
>>>>                            The special thing about the items
of our
>>>>                lists is, that
>>>>                        cassandra is
>>>>                            not able to sort them as the data
is stored
>>>>                in a binary
>>>>                        format which
>>>>                            is not sortable. However, we are
able to
>>>>                sort the data
>>>>                        before the
>>>>                            plain data gets encoded (our application
is
>>>>                responsible for
>>>>                        the order).
>>>>
>>>>                            First Approach: Storing Lists in
>>>> ColumnFamilies
>>>>                            ***
>>>>                            We first tried to map the list to
a single
>>>>                row of a
>>>>                        ColumnFamily in
>>>>                            a way that the index of the list
is mapped
>>>>                to the column
>>>>                        names and
>>>>                            the items of the list to the column
values.
>>>>                The column names
>>>>                are
>>>>                            increasing numbers which define
the sort
>>>> order.
>>>>                            This has the major drawback that
big parts
>>>>                of the list have
>>>>                        to be
>>>>                            rewritten on inserts (because the
column
>>>>                names are numbered
>>>>                        by their
>>>>                            index), which are quite common.
>>>>
>>>>
>>>>                            Second Approach: Storing the whole
List as
>>>>                Binary Data:
>>>>                            ***
>>>>                            We tried to store the compressed
list in a
>>>>                single column.
>>>>                        However,
>>>>                            this is only feasible for smaller
lists. Our
>>>>                lists are far
>>>>                        to big
>>>>                            leading to multi megabyte reads
and writes.
>>>>                As we need to
>>>>                        read and
>>>>                            update the lists quite often, this
would put
>>>>                our Cassandra
>>>>                        cluster
>>>>                            under a lot of pressure.
>>>>
>>>>                            Ideal Solution: Native support for
storing
>>>> lists
>>>>                            ***
>>>>                            We would be very happy with a way
to store a
>>>>                list of sorted
>>>>                        values
>>>>                            without making improper use of column
names
>>>>                for the list
>>>>                        index. This
>>>>                            implies that we would need a possibility
to
>>>>                insert values at
>>>>                        defined
>>>>                            positions. We know that this could
lead to
>>>>                problems with
>>>>                        concurrent
>>>>                            inserts in a distributed environment,
but
>>>>                this is handled by
>>>>                our
>>>>                            application logic.
>>>>
>>>>
>>>>                            What are your ideas on that?
>>>>
>>>>                            Thanks
>>>>                            Matthias
>>>>
>>>>
>>>>
>>>>
>>>>                        --
>>>>                        *David McNelis*
>>>>                        Lead Software Engineer
>>>>                        Agentis Energy
>>>>                www.agentisenergy.com
>>>>
>>>>  <http://www.agentisenergy.com><http://__www.agentisenergy.com
>>>>                <http://www.agentisenergy.com>>
>>>>                <http://www.agentisenergy.com>
>>>>                        c: 219.384.5143
>>>>                <tel:219.384.5143><tel:219.384.5143<tel:219.384.5143>>
>>>>
>>>>                        /A Smart Grid technology company focused
on
>>>>                helping consumers of
>>>>                        energy
>>>>                        control an often under-managed resource./
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>
>

Mime
View raw message