Return-Path: Delivered-To: apmail-incubator-cassandra-user-archive@minotaur.apache.org Received: (qmail 17729 invoked from network); 12 Mar 2010 21:08:45 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Mar 2010 21:08:45 -0000 Received: (qmail 82226 invoked by uid 500); 12 Mar 2010 21:08:07 -0000 Delivered-To: apmail-incubator-cassandra-user-archive@incubator.apache.org Received: (qmail 82208 invoked by uid 500); 12 Mar 2010 21:08:07 -0000 Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-user@incubator.apache.org Delivered-To: mailing list cassandra-user@incubator.apache.org Received: (qmail 82200 invoked by uid 99); 12 Mar 2010 21:08:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Mar 2010 21:08:07 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of driftx@gmail.com designates 209.85.221.177 as permitted sender) Received: from [209.85.221.177] (HELO mail-qy0-f177.google.com) (209.85.221.177) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Mar 2010 21:08:05 +0000 Received: by qyk7 with SMTP id 7so1567194qyk.21 for ; Fri, 12 Mar 2010 13:07:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=RDTkLnjWDUYXbsqeBeCgE3Uo0l//9GfFCiXsfure86E=; b=rX2V6hiibUTo8QPV0aM0PNPlv0yduWEclfxA0eFkamW/74UEVZwSi42057Sa2WSfsc PxjKzqrLyJXHS7+DMW3wf8giye6pS0uxfA5tzrd/Kt1A0wBW/nH+H8cIsiGdjeHb2bAe iD5a+cpZPhDR00LhFcdAFhMBzLjsTA+7ODrDc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=s9HUwI66vnTcULGQ67wUV01TwNBUOr2vp+Iwv6/s5ex68g0QNhWGB3FAGzxMTSrKUi MbZwTmDXrlEYFcacHB/dhvNwen/9jRHTQZUGS9m1B2Eym5vseTr0IYpfhE0vLI7AiCXl xrzECEbbIJtUGev0wNrP5bj4nbLbICqLHWd5o= MIME-Version: 1.0 Received: by 10.220.124.136 with SMTP id u8mr1518601vcr.145.1268428064719; Fri, 12 Mar 2010 13:07:44 -0800 (PST) In-Reply-To: <458b8f9e1003102254gae3eab9p4bacde04c797a5f4@mail.gmail.com> References: <458b8f9e1003102254gae3eab9p4bacde04c797a5f4@mail.gmail.com> Date: Fri, 12 Mar 2010 15:07:44 -0600 Message-ID: Subject: Re: Strategies for storing lexically ordered data in supercolumns From: Brandon Williams To: cassandra-user@incubator.apache.org Content-Type: multipart/alternative; boundary=0016369208b43882c20481a0ec7a --0016369208b43882c20481a0ec7a Content-Type: text/plain; charset=ISO-8859-1 On Thu, Mar 11, 2010 at 12:54 AM, Peter Chang wrote: > I'm wondering about good strategies for picking keys that I want to be > lexically sorted in a super column family. For example, my data looks like > this: > > [user1_uuid][connections][some_key_for_user2] = "" > [user1_uuid][connections][some_key_for_user3] = "" > > I was thinking that I wanted some_key_for_user2 to be sorted by a user's > name. So I was thinking I set the subcolumn compareWith to UTF8Type or > BytesType and construct a key > > [user's lastname + user's firstname + user's uuid] > > This would result in sorted subcolumn and user list. That's fine. But I > wonder what would happen if, say, a user changes their last name. Happens > rarely but I imagine people getting married and modifying their name. Now > the sort is no longer correct. There seems to be some bad consequences to > creating keys based on data that can change. > > So what is the general (elegant, easy to maintain) strategy here? Always > sort in your server-side code and don't bother trying to have the data > sorted? > Having row keys based on something potentially volatile is something I would avoid since that determines which machine the row belongs to and moving data between machines isn't a cheap operation. What you'll probably want to do is make the key something unique (like a uuid), store the user's name as a column on the row (thus making it easy to update) and maintain a secondary index to get the named-based sorting you want. If you're expecting a few million users, maintaining the index in a special row will work fine (eg, the row name is "NAMEINDEX" and the columns are the name+uuid similar to what you described.) If you have billions of users, you'll need to get a bit fancier (partition based on letter of the last name, for example.) -Brandon --0016369208b43882c20481a0ec7a Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
On Thu, Mar 11, 2010 at 12:54 AM, Peter Chang <= peter78@gmail.com> wrote:
I'm wondering about good strategies for picking keys that I want t= o be lexically sorted in a super column family. For example, my data looks = like this:

[user1_uuid][connections][some_key_for_= user2] =3D ""
[user1_uuid][connections][some_key_for_user3] =3D ""

I was thinking that I wanted some_key_for_user= 2 to be sorted by a user's name. So I was thinking I set the subcolumn = compareWith to UTF8Type or BytesType and construct a key=A0

[user's lastname + user's firstname + user'= s uuid]
=A0
This would result in sorted subcolumn and u= ser list. That's fine. But I wonder what would happen if, say, a user c= hanges their last name. Happens rarely but I imagine people getting married= and modifying their name. Now the sort is no longer correct. There seems t= o be some bad consequences to creating keys based on data that can change.= =A0

So what is the general (elegant, easy to maintain) stra= tegy here? Always sort in your server-side code and don't bother trying= to have the data sorted?=A0

Having r= ow keys based on something potentially volatile is something I would avoid = since that determines which machine the row belongs to and moving data betw= een machines isn't a cheap operation.

What you'll probably want to do is make the key som= ething unique (like a uuid), store the user's name as a column on the r= ow (thus making it easy to update) and maintain a secondary index to get th= e named-based sorting you want. =A0If you're expecting a few million us= ers, maintaining the index in a special row will work fine (eg, the row nam= e is "NAMEINDEX" and the columns are the name+uuid similar to wha= t you described.) =A0If you have billions of users, you'll need to get = a bit fancier (partition based on letter of the last name, for example.)

-Brandon
--0016369208b43882c20481a0ec7a--