Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B69869F62 for ; Thu, 13 Oct 2011 20:21:57 +0000 (UTC) Received: (qmail 76746 invoked by uid 500); 13 Oct 2011 20:21:55 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 76718 invoked by uid 500); 13 Oct 2011 20:21:55 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 76710 invoked by uid 99); 13 Oct 2011 20:21:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Oct 2011 20:21:55 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [130.75.2.106] (HELO mrelay1.uni-hannover.de) (130.75.2.106) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Oct 2011 20:21:47 +0000 Received: from server1.l3s.uni-hannover.de (server1.l3s.uni-hannover.de [130.75.87.1]) by mrelay1.uni-hannover.de (8.14.4/8.14.4) with ESMTP id p9DKLLdw026812 for ; Thu, 13 Oct 2011 22:21:22 +0200 Received: by server1.l3s.uni-hannover.de (Postfix, from userid 21011) id 58A9F3240382; Thu, 13 Oct 2011 22:21:21 +0200 (CEST) X-Spam-Level: X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on server1.l3s.uni-hannover.de Received: from [192.168.0.147] (77-20-118-4-dynip.superkabel.de [77.20.118.4]) (Authenticated sender: pfau@server1.l3s.uni-hannover.de) by server1.l3s.uni-hannover.de (Postfix) with ESMTP id 706133240328 for ; Thu, 13 Oct 2011 22:21:09 +0200 (CEST) Message-ID: <4E974834.2030205@l3s.de> Date: Thu, 13 Oct 2011 22:21:08 +0200 From: Matthias Pfau User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110922 Thunderbird/3.1.15 MIME-Version: 1.0 To: user@cassandra.apache.org Subject: Re: Storing pre-sorted data References: <4E95C17C.5000008@l3s.de> <4E95C6C6.2060304@l3s.de> <4E969786.7040208@l3s.de> <4E97186A.5010401@l3s.de> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-PMX-Version: 5.5.9.395186, Antispam-Engine: 2.7.2.376379, Antispam-Data: 2011.10.13.201215 X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-51.9 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.3.1 Hi Stephen, we are hashing the first 8 byte (8 US-ASCII characters) of text that has been written by humans. Wouldn't it be easy for the attacker to do a dictionary attack on this text, especially if he knows the language of the text? Kind regards Matthias On 10/13/2011 08:20 PM, Stephen Connolly wrote: > in theory, however they have less than 32 bits of entropy from which > they can do that, leaving them with at least 32 more bits of > combinations to try... that's 2 billion or so... must be a big dictionary > > - Stephen > > --- > Sent from my Android phone, so random spelling mistakes, random nonsense > words and other nonsense are a direct result of using swype to type on > the screen > > On 13 Oct 2011 17:57, "Matthias Pfau" > > wrote: > > Hi Stephen, > this sounds very reasonable. But wouldn't this enable an attacker to > execute dictionary attacks in order to "decrypt" the first 8 bytes > of the plain text? > > Kind regards > Matthias > > On 10/13/2011 05:03 PM, Stephen Connolly wrote: > > It wouldn't be unencrypted... which is the point > > you use a one way linear hash function to take the first, say 8 > bytes, > of unencrypted data and turn it into 4 bytes of a sort prefix. > > You've used lost half the data in the process, so effectively > each bit > is an OR of two bits and you can only infer from 0 values... so data > is still encrypted, but you have an approximate sorting. > > For example, if your data is US-ASCII text with no numbers, you > could > use Soundex to get the pre-key, so that worst case you have a bucket > of values in the range. > > Using this technique, a random get will have to get the values > at the > desired prefix +/- a small amount rather than the whole row... > on the > client side you can then decrypt the data and sort that small bucket > to get the correct index position. > > You could do a 1 byte prefix, but that only gives you at best 256 > buckets and assumes that the first 2 bytes are uniformly > distributed... you've said your data is not uniformly > distributed, so > a linear hash function sounds like your best bet. > > your hash function should have the property that hash(A)>= > hash(B) if > and only if A>= B > > On 13 October 2011 08:47, Matthias Pfau > wrote: > > Hi Stephen, > this is a great idea but unfortunately doesn't work for us > either as we can > not store the data in an unencrypted form. > > Kind regards > Matthias > > On 10/12/2011 07:42 PM, Stephen Connolly wrote: > > > could you prefix the data with 3-4 bytes of a linear > hash of the > unencypted data? it wouldn't be a perfect sort, but > you'd have less of a > range to query to get the sorted values? > > - Stephen > > --- > Sent from my Android phone, so random spelling mistakes, > random nonsense > words and other nonsense are a direct result of using > swype to type on > the screen > > On 12 Oct 2011 17:57, "Matthias Pfau" >> > wrote: > > Unfortunately, that is not an option as we have to > store the data in > an compressed and encrypted and therefore binary and > non-sortable form. > > On 10/12/2011 06:39 PM, David McNelis wrote: > > Is it an option to not convert the data to > binary prior to > inserting > into Cassandra? Also, how large are the strings > you're sorting? > If its > viable to not convert to binary before writing > to Cassandra, and > you use > one of the string based column ordering > techniques (utf8, ascii, > for > example), then the data would be sorted without > you needing to > specifically worry about that. Of course, if > the strings are > lengthy > you could run into additional issues. > > On Wed, Oct 12, 2011 at 11:34 AM, Matthias > Pfau > > > >>> wrote: > > Hi there, > we are currently building a prototype based > on cassandra and > came > into problems on implementing sorted lists > containing > millions of items. > > The special thing about the items of our > lists is, that > cassandra is > not able to sort them as the data is stored > in a binary > format which > is not sortable. However, we are able to > sort the data > before the > plain data gets encoded (our application is > responsible for > the order). > > First Approach: Storing Lists in ColumnFamilies > *** > We first tried to map the list to a single > row of a > ColumnFamily in > a way that the index of the list is mapped > to the column > names and > the items of the list to the column values. > The column names > are > increasing numbers which define the sort order. > This has the major drawback that big parts > of the list have > to be > rewritten on inserts (because the column > names are numbered > by their > index), which are quite common. > > > Second Approach: Storing the whole List as > Binary Data: > *** > We tried to store the compressed list in a > single column. > However, > this is only feasible for smaller lists. Our > lists are far > to big > leading to multi megabyte reads and writes. > As we need to > read and > update the lists quite often, this would put > our Cassandra > cluster > under a lot of pressure. > > Ideal Solution: Native support for storing lists > *** > We would be very happy with a way to store a > list of sorted > values > without making improper use of column names > for the list > index. This > implies that we would need a possibility to > insert values at > defined > positions. We know that this could lead to > problems with > concurrent > inserts in a distributed environment, but > this is handled by > our > application logic. > > > What are your ideas on that? > > Thanks > Matthias > > > > > -- > *David McNelis* > Lead Software Engineer > Agentis Energy > www.agentisenergy.com > > > > c: 219.384.5143 > > > > /A Smart Grid technology company focused on > helping consumers of > energy > control an often under-managed resource./ > > > > > >