cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Pfau <p...@l3s.de>
Subject Re: Storing pre-sorted data
Date Thu, 13 Oct 2011 20:21:08 GMT
Hi Stephen,
we are hashing the first 8 byte (8 US-ASCII characters) of text that has 
been written by humans. Wouldn't it be easy for the attacker to do a 
dictionary attack on this text, especially if he knows the language of 
the text?

Kind regards
Matthias

On 10/13/2011 08:20 PM, Stephen Connolly wrote:
> in theory, however they have less than 32 bits of entropy from which
> they can do that, leaving them with at least 32 more bits of
> combinations to try... that's 2 billion or so... must be a big dictionary
>
> - Stephen
>
> ---
> Sent from my Android phone, so random spelling mistakes, random nonsense
> words and other nonsense are a direct result of using swype to type on
> the screen
>
> On 13 Oct 2011 17:57, "Matthias Pfau" <pfau@l3s.de <mailto:pfau@l3s.de>>
> wrote:
>
>     Hi Stephen,
>     this sounds very reasonable. But wouldn't this enable an attacker to
>     execute dictionary attacks in order to "decrypt" the first 8 bytes
>     of the plain text?
>
>     Kind regards
>     Matthias
>
>     On 10/13/2011 05:03 PM, Stephen Connolly wrote:
>
>         It wouldn't be unencrypted... which is the point
>
>         you use a one way linear hash function to take the first, say 8
>         bytes,
>         of unencrypted data and turn it into 4 bytes of a sort prefix.
>
>         You've used lost half the data in the process, so effectively
>         each bit
>         is an OR of two bits and you can only infer from 0 values... so data
>         is still encrypted, but you have an approximate sorting.
>
>         For example, if your data is US-ASCII text with no numbers, you
>         could
>         use Soundex to get the pre-key, so that worst case you have a bucket
>         of values in the range.
>
>         Using this technique, a random get will have to get the values
>         at the
>         desired prefix +/- a small amount rather than the whole row...
>         on the
>         client side you can then decrypt the data and sort that small bucket
>         to get the correct index position.
>
>         You could do a 1 byte prefix, but that only gives you at best 256
>         buckets and assumes that the first 2 bytes are uniformly
>         distributed... you've said your data is not uniformly
>         distributed, so
>         a linear hash function sounds like your best bet.
>
>         your hash function should have the property that hash(A)>=
>         hash(B) if
>         and only if A>= B
>
>         On 13 October 2011 08:47, Matthias Pfau<pfau@l3s.de
>         <mailto:pfau@l3s.de>>  wrote:
>
>             Hi Stephen,
>             this is a great idea but unfortunately doesn't work for us
>             either as we can
>             not store the data in an unencrypted form.
>
>             Kind regards
>             Matthias
>
>             On 10/12/2011 07:42 PM, Stephen Connolly wrote:
>
>
>                 could you prefix the data with 3-4 bytes of a linear
>                 hash of the
>                 unencypted data? it wouldn't be a perfect sort, but
>                 you'd have less of a
>                 range to query to get the sorted values?
>
>                 - Stephen
>
>                 ---
>                 Sent from my Android phone, so random spelling mistakes,
>                 random nonsense
>                 words and other nonsense are a direct result of using
>                 swype to type on
>                 the screen
>
>                 On 12 Oct 2011 17:57, "Matthias Pfau"<pfau@l3s.de
>                 <mailto:pfau@l3s.de><mailto:pfau@__l3s.de
>                 <mailto:pfau@l3s.de>>>
>                 wrote:
>
>                     Unfortunately, that is not an option as we have to
>                 store the data in
>                     an compressed and encrypted and therefore binary and
>                 non-sortable form.
>
>                     On 10/12/2011 06:39 PM, David McNelis wrote:
>
>                         Is it an option to not convert the data to
>                 binary prior to
>                 inserting
>                         into Cassandra?  Also, how large are the strings
>                 you're sorting?
>                           If its
>                         viable to not convert to binary before writing
>                 to Cassandra, and
>                         you use
>                         one of the string based column ordering
>                 techniques (utf8, ascii,
>                 for
>                         example), then the data would be sorted without
>                 you  needing to
>                         specifically worry about that.  Of course, if
>                 the strings are
>                         lengthy
>                         you could run into  additional issues.
>
>                         On Wed, Oct 12, 2011 at 11:34 AM, Matthias
>                 Pfau<pfau@l3s.de <mailto:pfau@l3s.de>
>                 <mailto:pfau@l3s.de <mailto:pfau@l3s.de>>
>                 <mailto:pfau@l3s.de
>                 <mailto:pfau@l3s.de><mailto:pfa__u@l3s.de
>                 <mailto:pfau@l3s.de>>>>  wrote:
>
>                             Hi there,
>                             we are currently building a prototype based
>                 on cassandra and
>                         came
>                             into problems on implementing sorted lists
>                 containing
>                         millions of items.
>
>                             The special thing about the items of our
>                 lists is, that
>                         cassandra is
>                             not able to sort them as the data is stored
>                 in a binary
>                         format which
>                             is not sortable. However, we are able to
>                 sort the data
>                         before the
>                             plain data gets encoded (our application is
>                 responsible for
>                         the order).
>
>                             First Approach: Storing Lists in ColumnFamilies
>                             ***
>                             We first tried to map the list to a single
>                 row of a
>                         ColumnFamily in
>                             a way that the index of the list is mapped
>                 to the column
>                         names and
>                             the items of the list to the column values.
>                 The column names
>                 are
>                             increasing numbers which define the sort order.
>                             This has the major drawback that big parts
>                 of the list have
>                         to be
>                             rewritten on inserts (because the column
>                 names are numbered
>                         by their
>                             index), which are quite common.
>
>
>                             Second Approach: Storing the whole List as
>                 Binary Data:
>                             ***
>                             We tried to store the compressed list in a
>                 single column.
>                         However,
>                             this is only feasible for smaller lists. Our
>                 lists are far
>                         to big
>                             leading to multi megabyte reads and writes.
>                 As we need to
>                         read and
>                             update the lists quite often, this would put
>                 our Cassandra
>                         cluster
>                             under a lot of pressure.
>
>                             Ideal Solution: Native support for storing lists
>                             ***
>                             We would be very happy with a way to store a
>                 list of sorted
>                         values
>                             without making improper use of column names
>                 for the list
>                         index. This
>                             implies that we would need a possibility to
>                 insert values at
>                         defined
>                             positions. We know that this could lead to
>                 problems with
>                         concurrent
>                             inserts in a distributed environment, but
>                 this is handled by
>                 our
>                             application logic.
>
>
>                             What are your ideas on that?
>
>                             Thanks
>                             Matthias
>
>
>
>
>                         --
>                         *David McNelis*
>                         Lead Software Engineer
>                         Agentis Energy
>                 www.agentisenergy.com
>                 <http://www.agentisenergy.com><http://__www.agentisenergy.com
>                 <http://www.agentisenergy.com>>
>                 <http://www.agentisenergy.com>
>                         c: 219.384.5143
>                 <tel:219.384.5143><tel:219.384.5143 <tel:219.384.5143>>
>
>                         /A Smart Grid technology company focused on
>                 helping consumers of
>                         energy
>                         control an often under-managed resource./
>
>
>
>
>
>


Mime
View raw message