Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: cassandra-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of driftx@gmail.com designates
 209.85.221.177 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=s9HUwI66vnTcULGQ67wUV01TwNBUOr2vp+Iwv6/s5ex68g0QNhWGB3FAGzxMTSrKUi
         MbZwTmDXrlEYFcacHB/dhvNwen/9jRHTQZUGS9m1B2Eym5vseTr0IYpfhE0vLI7AiCXl
         xrzECEbbIJtUGev0wNrP5bj4nbLbICqLHWd5o=
MIME-Version: 1.0
In-Reply-To: <458b8f9e1003102254gae3eab9p4bacde04c797a5f4@mail.gmail.com>
References: <458b8f9e1003102254gae3eab9p4bacde04c797a5f4@mail.gmail.com>
Date: Fri, 12 Mar 2010 15:07:44 -0600
Message-ID: <cdc5ad201003121307m1c893cd4iccfdc6bd74e24d49@mail.gmail.com>
Subject: Re: Strategies for storing lexically ordered data in supercolumns
From: Brandon Williams <driftx@gmail.com>
To: cassandra-user@incubator.apache.org
Content-Type: multipart/alternative; boundary=0016369208b43882c20481a0ec7a

--0016369208b43882c20481a0ec7a
Content-Type: text/plain; charset=ISO-8859-1

On Thu, Mar 11, 2010 at 12:54 AM, Peter Chang <peter78@gmail.com> wrote:

> I'm wondering about good strategies for picking keys that I want to be
> lexically sorted in a super column family. For example, my data looks like
> this:
>
> [user1_uuid][connections][some_key_for_user2] = ""
> [user1_uuid][connections][some_key_for_user3] = ""
>
> I was thinking that I wanted some_key_for_user2 to be sorted by a user's
> name. So I was thinking I set the subcolumn compareWith to UTF8Type or
> BytesType and construct a key
>
> [user's lastname + user's firstname + user's uuid]
>
> This would result in sorted subcolumn and user list. That's fine. But I
> wonder what would happen if, say, a user changes their last name. Happens
> rarely but I imagine people getting married and modifying their name. Now
> the sort is no longer correct. There seems to be some bad consequences to
> creating keys based on data that can change.
>
> So what is the general (elegant, easy to maintain) strategy here? Always
> sort in your server-side code and don't bother trying to have the data
> sorted?
>

Having row keys based on something potentially volatile is something I would
avoid since that determines which machine the row belongs to and moving data
between machines isn't a cheap operation.

What you'll probably want to do is make the key something unique (like a
uuid), store the user's name as a column on the row (thus making it easy to
update) and maintain a secondary index to get the named-based sorting you
want.  If you're expecting a few million users, maintaining the index in a
special row will work fine (eg, the row name is "NAMEINDEX" and the columns
are the name+uuid similar to what you described.)  If you have billions of
users, you'll need to get a bit fancier (partition based on letter of the
last name, for example.)

-Brandon

--0016369208b43882c20481a0ec7a
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div class=3D"gmail_quote">On Thu, Mar 11, 2010 at 12:54 AM, Peter Chang <s=
pan dir=3D"ltr">&lt;<a href=3D"mailto:peter78@gmail.com" target=3D"_blank">=
peter78@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div>I&#39;m wondering about good strategies for picking keys that I want t=
o be lexically sorted in a super column family. For example, my data looks =
like this:</div><div><br></div><div>[user1_uuid][connections][some_key_for_=
user2] =3D &quot;&quot;</div>


<div><div>[user1_uuid][connections][some_key_for_user3] =3D &quot;&quot;</d=
iv><div><br></div></div><div>I was thinking that I wanted some_key_for_user=
2 to be sorted by a user&#39;s name. So I was thinking I set the subcolumn =
compareWith to UTF8Type or BytesType and construct a key=A0</div>


<div><br></div><div>[user&#39;s lastname + user&#39;s firstname + user&#39;=
s uuid]</div><div>=A0</div><div>This would result in sorted subcolumn and u=
ser list. That&#39;s fine. But I wonder what would happen if, say, a user c=
hanges their last name. Happens rarely but I imagine people getting married=
 and modifying their name. Now the sort is no longer correct. There seems t=
o be some bad consequences to creating keys based on data that can change.=
=A0</div>


<div><br></div><div>So what is the general (elegant, easy to maintain) stra=
tegy here? Always sort in your server-side code and don&#39;t bother trying=
 to have the data sorted?=A0</div></blockquote><div><br></div><div>Having r=
ow keys based on something potentially volatile is something I would avoid =
since that determines which machine the row belongs to and moving data betw=
een machines isn&#39;t a cheap operation.</div>

<div><br></div><div>What you&#39;ll probably want to do is make the key som=
ething unique (like a uuid), store the user&#39;s name as a column on the r=
ow (thus making it easy to update) and maintain a secondary index to get th=
e named-based sorting you want. =A0If you&#39;re expecting a few million us=
ers, maintaining the index in a special row will work fine (eg, the row nam=
e is &quot;NAMEINDEX&quot; and the columns are the name+uuid similar to wha=
t you described.) =A0If you have billions of users, you&#39;ll need to get =
a bit fancier (partition based on letter of the last name, for example.)</d=
iv>
<div><br></div><div>-Brandon</div>

</div>

--0016369208b43882c20481a0ec7a--