Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ED8D99DBC for ; Thu, 13 Oct 2011 20:43:48 +0000 (UTC) Received: (qmail 16023 invoked by uid 500); 13 Oct 2011 20:43:46 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 15995 invoked by uid 500); 13 Oct 2011 20:43:46 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 15987 invoked by uid 99); 13 Oct 2011 20:43:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Oct 2011 20:43:46 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of stephen.alan.connolly@gmail.com designates 74.125.82.172 as permitted sender) Received: from [74.125.82.172] (HELO mail-wy0-f172.google.com) (74.125.82.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Oct 2011 20:43:40 +0000 Received: by wyg24 with SMTP id 24so2618252wyg.31 for ; Thu, 13 Oct 2011 13:43:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=iq6663aaJihS8gZ3rCgR33E2O0jhHIvKXNX9gPTyY54=; b=U1qbQ1pbcSdkMlLOaiw4FuyV5Nm0tf2yuZR60nX/ZldGKTXHF4c5wyyw1mKRiaMZDy m2YZJOk3znjRCUOqmmJLN9PzFCkNfX9HNS8KteWK4m0oY/pgpRRiSfF5pEakkGUoBUh2 k/LBt9Fk5Ev0MmNedkZa5yyOV59xerE20h9Rk= MIME-Version: 1.0 Received: by 10.216.230.17 with SMTP id i17mr451114weq.88.1318538599819; Thu, 13 Oct 2011 13:43:19 -0700 (PDT) Received: by 10.216.164.194 with HTTP; Thu, 13 Oct 2011 13:43:19 -0700 (PDT) In-Reply-To: <4E974834.2030205@l3s.de> References: <4E95C17C.5000008@l3s.de> <4E95C6C6.2060304@l3s.de> <4E969786.7040208@l3s.de> <4E97186A.5010401@l3s.de> <4E974834.2030205@l3s.de> Date: Thu, 13 Oct 2011 21:43:19 +0100 Message-ID: Subject: Re: Storing pre-sorted data From: Stephen Connolly To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Then just use a soundex function on the first word in the text... that will shrink it sufficiently and give nice buckets in near sequential order (http://en.wikipedia.org/wiki/Soundex) On 13 October 2011 21:21, Matthias Pfau wrote: > Hi Stephen, > we are hashing the first 8 byte (8 US-ASCII characters) of text that has > been written by humans. Wouldn't it be easy for the attacker to do a > dictionary attack on this text, especially if he knows the language of th= e > text? > > Kind regards > Matthias > > On 10/13/2011 08:20 PM, Stephen Connolly wrote: >> >> in theory, however they have less than 32 bits of entropy from which >> they can do that, leaving them with at least 32 more bits of >> combinations to try... that's 2 billion or so... must be a big dictionar= y >> >> - Stephen >> >> --- >> Sent from my Android phone, so random spelling mistakes, random nonsense >> words and other nonsense are a direct result of using swype to type on >> the screen >> >> On 13 Oct 2011 17:57, "Matthias Pfau" > >> wrote: >> >> =A0 =A0Hi Stephen, >> =A0 =A0this sounds very reasonable. But wouldn't this enable an attacker= to >> =A0 =A0execute dictionary attacks in order to "decrypt" the first 8 byte= s >> =A0 =A0of the plain text? >> >> =A0 =A0Kind regards >> =A0 =A0Matthias >> >> =A0 =A0On 10/13/2011 05:03 PM, Stephen Connolly wrote: >> >> =A0 =A0 =A0 =A0It wouldn't be unencrypted... which is the point >> >> =A0 =A0 =A0 =A0you use a one way linear hash function to take the first,= say 8 >> =A0 =A0 =A0 =A0bytes, >> =A0 =A0 =A0 =A0of unencrypted data and turn it into 4 bytes of a sort pr= efix. >> >> =A0 =A0 =A0 =A0You've used lost half the data in the process, so effecti= vely >> =A0 =A0 =A0 =A0each bit >> =A0 =A0 =A0 =A0is an OR of two bits and you can only infer from 0 values= ... so >> data >> =A0 =A0 =A0 =A0is still encrypted, but you have an approximate sorting. >> >> =A0 =A0 =A0 =A0For example, if your data is US-ASCII text with no number= s, you >> =A0 =A0 =A0 =A0could >> =A0 =A0 =A0 =A0use Soundex to get the pre-key, so that worst case you ha= ve a >> bucket >> =A0 =A0 =A0 =A0of values in the range. >> >> =A0 =A0 =A0 =A0Using this technique, a random get will have to get the v= alues >> =A0 =A0 =A0 =A0at the >> =A0 =A0 =A0 =A0desired prefix +/- a small amount rather than the whole r= ow... >> =A0 =A0 =A0 =A0on the >> =A0 =A0 =A0 =A0client side you can then decrypt the data and sort that s= mall >> bucket >> =A0 =A0 =A0 =A0to get the correct index position. >> >> =A0 =A0 =A0 =A0You could do a 1 byte prefix, but that only gives you at = best 256 >> =A0 =A0 =A0 =A0buckets and assumes that the first 2 bytes are uniformly >> =A0 =A0 =A0 =A0distributed... you've said your data is not uniformly >> =A0 =A0 =A0 =A0distributed, so >> =A0 =A0 =A0 =A0a linear hash function sounds like your best bet. >> >> =A0 =A0 =A0 =A0your hash function should have the property that hash(A)>= =3D >> =A0 =A0 =A0 =A0hash(B) if >> =A0 =A0 =A0 =A0and only if A>=3D B >> >> =A0 =A0 =A0 =A0On 13 October 2011 08:47, Matthias Pfau> =A0 =A0 =A0 =A0> =A0wrote: >> >> =A0 =A0 =A0 =A0 =A0 =A0Hi Stephen, >> =A0 =A0 =A0 =A0 =A0 =A0this is a great idea but unfortunately doesn't wo= rk for us >> =A0 =A0 =A0 =A0 =A0 =A0either as we can >> =A0 =A0 =A0 =A0 =A0 =A0not store the data in an unencrypted form. >> >> =A0 =A0 =A0 =A0 =A0 =A0Kind regards >> =A0 =A0 =A0 =A0 =A0 =A0Matthias >> >> =A0 =A0 =A0 =A0 =A0 =A0On 10/12/2011 07:42 PM, Stephen Connolly wrote: >> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0could you prefix the data with 3-4 bytes = of a linear >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0hash of the >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0unencypted data? it wouldn't be a perfect= sort, but >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0you'd have less of a >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0range to query to get the sorted values? >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0- Stephen >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0--- >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Sent from my Android phone, so random spe= lling mistakes, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0random nonsense >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0words and other nonsense are a direct res= ult of using >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0swype to type on >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0the screen >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0On 12 Oct 2011 17:57, "Matthias Pfau"> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0wrote: >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Unfortunately, that is not an opt= ion as we have to >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0store the data in >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0an compressed and encrypted and t= herefore binary and >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0non-sortable form. >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0On 10/12/2011 06:39 PM, David McN= elis wrote: >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Is it an option to not co= nvert the data to >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0binary prior to >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0inserting >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0into Cassandra? =A0Also, = how large are the strings >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0you're sorting? >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0If its >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0viable to not convert to = binary before writing >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0to Cassandra, and >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0you use >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0one of the string based c= olumn ordering >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0techniques (utf8, ascii, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0for >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0example), then the data w= ould be sorted without >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0you =A0needing to >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0specifically worry about = that. =A0Of course, if >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0the strings are >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0lengthy >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0you could run into =A0add= itional issues. >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0On Wed, Oct 12, 2011 at 1= 1:34 AM, Matthias >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Pfau >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0>>> =A0wrote: >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Hi there, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0we are currently = building a prototype based >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0on cassandra and >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0came >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0into problems on = implementing sorted lists >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0containing >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0millions of items. >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0The special thing= about the items of our >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0lists is, that >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cassandra is >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0not able to sort = them as the data is stored >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0in a binary >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0format which >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0is not sortable. = However, we are able to >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0sort the data >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0before the >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0plain data gets e= ncoded (our application is >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0responsible for >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0the order). >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0First Approach: S= toring Lists in ColumnFamilies >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*** >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0We first tried to= map the list to a single >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0row of a >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ColumnFamily in >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0a way that the in= dex of the list is mapped >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0to the column >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0names and >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0the items of the = list to the column values. >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0The column names >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0are >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0increasing number= s which define the sort order. >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0This has the majo= r drawback that big parts >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0of the list have >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0to be >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rewritten on inse= rts (because the column >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0names are numbered >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0by their >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0index), which are= quite common. >> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Second Approach: = Storing the whole List as >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Binary Data: >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*** >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0We tried to store= the compressed list in a >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0single column. >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0However, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0this is only feas= ible for smaller lists. Our >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0lists are far >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0to big >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0leading to multi = megabyte reads and writes. >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0As we need to >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0read and >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0update the lists = quite often, this would put >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0our Cassandra >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cluster >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0under a lot of pr= essure. >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Ideal Solution: N= ative support for storing >> lists >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*** >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0We would be very = happy with a way to store a >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0list of sorted >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0values >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0without making im= proper use of column names >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0for the list >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0index. This >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0implies that we w= ould need a possibility to >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0insert values at >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0defined >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0positions. We kno= w that this could lead to >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0problems with >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0concurrent >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0inserts in a dist= ributed environment, but >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0this is handled by >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0our >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0application logic= . >> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0What are your ide= as on that? >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Thanks >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Matthias >> >> >> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-- >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*David McNelis* >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Lead Software Engineer >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Agentis Energy >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0www.agentisenergy.com >> >> =A0> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0c: 219.384.5143 >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/A Smart Grid technology = company focused on >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0helping consumers of >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0energy >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0control an often under-ma= naged resource./ >> >> >> >> >> >> > >