Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of decker.christian@gmail.com
 designates 209.85.216.179 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:from:date:message-id:subject:to:content-type;
        b=XWZGX5+rBPoDRQqLqSdVOOHAyeEr8zcWFkcepCfj2Va6SHTS5vnx8QNLG3TdKlaZz0
         p+0jj9mVkC+JVH1HqSgVFvmj3zQKRYxEkgrI62RonCIEeea2fZI7bCzm7rjQVicjvQwJ
         +W4Q2d6n2/WPeZvPZfl4dFxp/XU5JGmrP7vpQ=
MIME-Version: 1.0
From: Christian Decker <decker.christian@gmail.com>
Date: Sat, 28 Aug 2010 17:45:09 +0200
Message-ID: <AANLkTik4JpEOHo1orMkKhVZLGxws=aCodhZzYUgn+5jA@mail.gmail.com>
Subject: Join & Range Query performance
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=00c09f83a1c7f86ca5048ee41ef7

--00c09f83a1c7f86ca5048ee41ef7
Content-Type: text/plain; charset=ISO-8859-1

I'm wondering what the performance considerations are on Join-like queries.

I have a ColumnFamily that holds millions of records (not unusual as I
understand) and I want to work on them using Pig and Hadoop. Until now we
always fetched all rows in Cassandra and just filtered and worked on them.
The idea now is to introduce indices to speed up some of these analysis.
Let's assume we have page hits, each of them has a user associated and many
of our queries work on the users, so creating a ColumnFamily whose key is
the user id would be logic, but that would mean that we'd store all the data
twice (once in the all encompassing ColumnFamily and once as
SubcolumnFamilies in the Index) and since we might insert additional indices
it would multiply our data size.

Usually in a relational world we'd not save the data in the index, but a
pointer to the real entry. Would it be wise to just store the key of the
item that is referenced and then iteratively fetch them from the cluster?

Also I'd like to know how key range queries perform against simple key
lookups since I'd like to build a dynamic storage system which splits really
large rows into smaller ones, by specifying one more byte of the key (so
from a\0\0\0\0 we might got to a\0\0\0\0 - a\255\0\0\0, and then get all
results by simply querying a\0\0\0\0 through a\255\255\255\255).
I have no idea if this is even possible, just playing around with some ideas
:D

Regards,
Chris

--00c09f83a1c7f86ca5048ee41ef7
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I&#39;m wondering what the performance considerations are on Join-like quer=
ies.
<div><br></div><div>I have a ColumnFamily that holds millions of records (n=
ot unusual as I understand) and I want to work on them using Pig and Hadoop=
. Until now we always fetched all rows in Cassandra and just filtered and w=
orked on them. The idea now is to introduce indices to speed up some of the=
se analysis. Let&#39;s assume we have page hits, each of them has a user as=
sociated and many of our queries work on the users, so creating a ColumnFam=
ily whose key is the user id would be logic, but that would mean that we=
9;d store all the data twice (once in the all encompassing ColumnFamily and=
 once as SubcolumnFamilies in the Index) and since we might insert addition=
al indices it would multiply our data size.</div>

<div><br></div><div>Usually in a relational world we&#39;d not save the dat=
a in the index, but a pointer to the real entry. Would it be wise to just s=
tore the key of the item that is referenced and then iteratively fetch them=
 from the cluster?</div>

<div><br></div><div>Also I&#39;d like to know how key range queries perform=
 against simple key lookups since I&#39;d like to build a dynamic storage s=
ystem which splits really large rows into smaller ones, by specifying one m=
ore byte of the key (so from a\0\0\0\0 we might got to a\0\0\0\0 - a\255\0\=
0\0, and then get all results by simply querying a\0\0\0\0 through a\255\25=
5\255\255).</div>

<div>I have no idea if this is even possible, just playing around with some=
 ideas :D</div><div><br></div><div>Regards,</div><div>Chris</div>

--00c09f83a1c7f86ca5048ee41ef7--