incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Pfau <p...@l3s.de>
Subject Re: Storing pre-sorted data
Date Mon, 17 Oct 2011 09:43:28 GMT
Thanks for that hint! However, it seems like soundex is a very language 
specific algorithm (US English). We have to get into this topic further...

Kind regards
Matthias

On 10/13/2011 10:43 PM, Stephen Connolly wrote:
> Then just use a soundex function on the first word in the text... that
> will shrink it sufficiently and give nice buckets in near sequential
> order (http://en.wikipedia.org/wiki/Soundex)
>
> On 13 October 2011 21:21, Matthias Pfau<pfau@l3s.de>  wrote:
>> Hi Stephen,
>> we are hashing the first 8 byte (8 US-ASCII characters) of text that has
>> been written by humans. Wouldn't it be easy for the attacker to do a
>> dictionary attack on this text, especially if he knows the language of the
>> text?
>>
>> Kind regards
>> Matthias
>>
>> On 10/13/2011 08:20 PM, Stephen Connolly wrote:
>>>
>>> in theory, however they have less than 32 bits of entropy from which
>>> they can do that, leaving them with at least 32 more bits of
>>> combinations to try... that's 2 billion or so... must be a big dictionary
>>>
>>> - Stephen
>>>
>>> ---
>>> Sent from my Android phone, so random spelling mistakes, random nonsense
>>> words and other nonsense are a direct result of using swype to type on
>>> the screen
>>>
>>> On 13 Oct 2011 17:57, "Matthias Pfau"<pfau@l3s.de<mailto:pfau@l3s.de>>
>>> wrote:
>>>
>>>     Hi Stephen,
>>>     this sounds very reasonable. But wouldn't this enable an attacker to
>>>     execute dictionary attacks in order to "decrypt" the first 8 bytes
>>>     of the plain text?
>>>
>>>     Kind regards
>>>     Matthias
>>>
>>>     On 10/13/2011 05:03 PM, Stephen Connolly wrote:
>>>
>>>         It wouldn't be unencrypted... which is the point
>>>
>>>         you use a one way linear hash function to take the first, say 8
>>>         bytes,
>>>         of unencrypted data and turn it into 4 bytes of a sort prefix.
>>>
>>>         You've used lost half the data in the process, so effectively
>>>         each bit
>>>         is an OR of two bits and you can only infer from 0 values... so
>>> data
>>>         is still encrypted, but you have an approximate sorting.
>>>
>>>         For example, if your data is US-ASCII text with no numbers, you
>>>         could
>>>         use Soundex to get the pre-key, so that worst case you have a
>>> bucket
>>>         of values in the range.
>>>
>>>         Using this technique, a random get will have to get the values
>>>         at the
>>>         desired prefix +/- a small amount rather than the whole row...
>>>         on the
>>>         client side you can then decrypt the data and sort that small
>>> bucket
>>>         to get the correct index position.
>>>
>>>         You could do a 1 byte prefix, but that only gives you at best 256
>>>         buckets and assumes that the first 2 bytes are uniformly
>>>         distributed... you've said your data is not uniformly
>>>         distributed, so
>>>         a linear hash function sounds like your best bet.
>>>
>>>         your hash function should have the property that hash(A)>=
>>>         hash(B) if
>>>         and only if A>= B
>>>
>>>         On 13 October 2011 08:47, Matthias Pfau<pfau@l3s.de
>>>         <mailto:pfau@l3s.de>>    wrote:
>>>
>>>             Hi Stephen,
>>>             this is a great idea but unfortunately doesn't work for us
>>>             either as we can
>>>             not store the data in an unencrypted form.
>>>
>>>             Kind regards
>>>             Matthias
>>>
>>>             On 10/12/2011 07:42 PM, Stephen Connolly wrote:
>>>
>>>
>>>                 could you prefix the data with 3-4 bytes of a linear
>>>                 hash of the
>>>                 unencypted data? it wouldn't be a perfect sort, but
>>>                 you'd have less of a
>>>                 range to query to get the sorted values?
>>>
>>>                 - Stephen
>>>
>>>                 ---
>>>                 Sent from my Android phone, so random spelling mistakes,
>>>                 random nonsense
>>>                 words and other nonsense are a direct result of using
>>>                 swype to type on
>>>                 the screen
>>>
>>>                 On 12 Oct 2011 17:57, "Matthias Pfau"<pfau@l3s.de
>>>                 <mailto:pfau@l3s.de><mailto:pfau@__l3s.de
>>>                 <mailto:pfau@l3s.de>>>
>>>                 wrote:
>>>
>>>                     Unfortunately, that is not an option as we have to
>>>                 store the data in
>>>                     an compressed and encrypted and therefore binary and
>>>                 non-sortable form.
>>>
>>>                     On 10/12/2011 06:39 PM, David McNelis wrote:
>>>
>>>                         Is it an option to not convert the data to
>>>                 binary prior to
>>>                 inserting
>>>                         into Cassandra?  Also, how large are the strings
>>>                 you're sorting?
>>>                           If its
>>>                         viable to not convert to binary before writing
>>>                 to Cassandra, and
>>>                         you use
>>>                         one of the string based column ordering
>>>                 techniques (utf8, ascii,
>>>                 for
>>>                         example), then the data would be sorted without
>>>                 you  needing to
>>>                         specifically worry about that.  Of course, if
>>>                 the strings are
>>>                         lengthy
>>>                         you could run into  additional issues.
>>>
>>>                         On Wed, Oct 12, 2011 at 11:34 AM, Matthias
>>>                 Pfau<pfau@l3s.de<mailto:pfau@l3s.de>
>>>                 <mailto:pfau@l3s.de<mailto:pfau@l3s.de>>
>>>                 <mailto:pfau@l3s.de
>>>                 <mailto:pfau@l3s.de><mailto:pfa__u@l3s.de
>>>                 <mailto:pfau@l3s.de>>>>    wrote:
>>>
>>>                             Hi there,
>>>                             we are currently building a prototype based
>>>                 on cassandra and
>>>                         came
>>>                             into problems on implementing sorted lists
>>>                 containing
>>>                         millions of items.
>>>
>>>                             The special thing about the items of our
>>>                 lists is, that
>>>                         cassandra is
>>>                             not able to sort them as the data is stored
>>>                 in a binary
>>>                         format which
>>>                             is not sortable. However, we are able to
>>>                 sort the data
>>>                         before the
>>>                             plain data gets encoded (our application is
>>>                 responsible for
>>>                         the order).
>>>
>>>                             First Approach: Storing Lists in ColumnFamilies
>>>                             ***
>>>                             We first tried to map the list to a single
>>>                 row of a
>>>                         ColumnFamily in
>>>                             a way that the index of the list is mapped
>>>                 to the column
>>>                         names and
>>>                             the items of the list to the column values.
>>>                 The column names
>>>                 are
>>>                             increasing numbers which define the sort order.
>>>                             This has the major drawback that big parts
>>>                 of the list have
>>>                         to be
>>>                             rewritten on inserts (because the column
>>>                 names are numbered
>>>                         by their
>>>                             index), which are quite common.
>>>
>>>
>>>                             Second Approach: Storing the whole List as
>>>                 Binary Data:
>>>                             ***
>>>                             We tried to store the compressed list in a
>>>                 single column.
>>>                         However,
>>>                             this is only feasible for smaller lists. Our
>>>                 lists are far
>>>                         to big
>>>                             leading to multi megabyte reads and writes.
>>>                 As we need to
>>>                         read and
>>>                             update the lists quite often, this would put
>>>                 our Cassandra
>>>                         cluster
>>>                             under a lot of pressure.
>>>
>>>                             Ideal Solution: Native support for storing
>>> lists
>>>                             ***
>>>                             We would be very happy with a way to store a
>>>                 list of sorted
>>>                         values
>>>                             without making improper use of column names
>>>                 for the list
>>>                         index. This
>>>                             implies that we would need a possibility to
>>>                 insert values at
>>>                         defined
>>>                             positions. We know that this could lead to
>>>                 problems with
>>>                         concurrent
>>>                             inserts in a distributed environment, but
>>>                 this is handled by
>>>                 our
>>>                             application logic.
>>>
>>>
>>>                             What are your ideas on that?
>>>
>>>                             Thanks
>>>                             Matthias
>>>
>>>
>>>
>>>
>>>                         --
>>>                         *David McNelis*
>>>                         Lead Software Engineer
>>>                         Agentis Energy
>>>                 www.agentisenergy.com
>>>
>>>   <http://www.agentisenergy.com><http://__www.agentisenergy.com
>>>                 <http://www.agentisenergy.com>>
>>>                 <http://www.agentisenergy.com>
>>>                         c: 219.384.5143
>>>                 <tel:219.384.5143><tel:219.384.5143<tel:219.384.5143>>
>>>
>>>                         /A Smart Grid technology company focused on
>>>                 helping consumers of
>>>                         energy
>>>                         control an often under-managed resource./
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>


Mime
View raw message