Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of charles.blaxland@gmail.com
 designates 209.85.214.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:from:date:message-id:subject:to:content-type;
        b=UrXUvuF8cqURPpvTvI7UQbCxT6SnuuQ++XnRxIJnFDxDFYCQpBlW6Se2yLBr7+yy4F
         LDECs+J/uO9OxjVQHFyWASFfuXEd+mHXKibCzSraPi1mhQyTGfrdcRLMse+suIKIfPlW
         WoxMwqyvIYf0PjGovUTYZPSbUxN9D+NwRbMmw=
MIME-Version: 1.0
From: Charles Blaxland <charles.blaxland@gmail.com>
Date: Sun, 15 May 2011 17:56:49 +1000
Message-ID: <BANLkTikXQ6CNrpT941zXVzKHGc--wcNRSA@mail.gmail.com>
Subject: Multiget_slice or composite column keys?
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0016e6d64527d3245004a34be28e

--0016e6d64527d3245004a34be28e
Content-Type: text/plain; charset=ISO-8859-1

Hi All,

New to Cassandra, so apologies if I don't fully grok stuff just yet.

I have data keyed by a key as well as a date. I want to run a query to get
multiple keys across multiple contiguous date ranges simultaneously. I'm
currently storing the date along with the row key like this:

key1|2011-05-15 {  c1 : , c2 :,  c3 : ... }
key1|2011-05-16 {  c1 : , c2 :,  c3 : ... }
key2|2011-05-15 {  c1 : , c2 :,  c3 : ... }
key2|2011-05-16 {  c1 : , c2 :,  c3 : ... }
...

I generate all the key/date combinations that I'm interested in and use
multiget_slice to retrieve them, pulling in all the columns for each key (I
need all the data, but the number of columns is small: less than 100). The
total number of row keys retrieved will only be 100 or so.

Now it strikes me I could also store this using composite columns, like
this:

key1 {  2011-05-15|c1 : , 2011-5-16|c1 : , 2011-05-15|c2 :, 2011-05-16|c2 :
, 2011-05-15|c3 : , 2011-05-16|c3 : , ... }
key2 {  2011-05-15|c1 : , 2011-5-16|c1 : , 2011-05-15|c2 :, 2011-05-16|c2 :
, 2011-05-15|c3 : , 2011-05-16|c3 : , ... }
...

Then use multislice_get again (but with less keys), and use a slice range to
only retrieve the dates I'm interested in.

Another alternative I guess would be to use OPP with the first storage
approach and get_range_slices, but as I understand this would not be great
for performance due to keys being clustered together on a single node?

So my question is, which approach is best? One downside to the latter I
guess is that the number of columns grows without bound (although with 2
billion to play with this isn't gonna be  a problem any time soon). Also
multiget_slice supports only one slice predicate, so I'd guess I'd have to
use multiple queries to get multiple date ranges.

Anyway, any thoughts/tips appreciated.

Thanks,
Charles

--0016e6d64527d3245004a34be28e
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi All,<div><br></div><div>New to Cassandra, so apologies if I don&#39;t fu=
lly grok stuff just yet.</div><div><br></div><div>I have data keyed by a ke=
y as well as a date. I want to run a query to get multiple keys across mult=
iple contiguous date ranges simultaneously. I&#39;m currently storing the d=
ate along with the row key like this:</div>

<div><br></div><div><div>key1|2011-05-15 {=A0 c1 : ,=A0c2 :,=A0=A0c3 : ...=
=A0}</div></div><div><div>key1|2011-05-16 {=A0 c1 : ,=A0c2 :,=A0=A0c3 : ...=
=A0}</div></div><div>key2|2011-05-15 {=A0 c1 : ,=A0c2 :,=A0=A0c3 :=A0...=A0=
}</div><div>key2|2011-05-16 {=A0 c1 : ,=A0c2 :,=A0=A0c3 :=A0...=A0}</div>

<div>...</div><div><br></div><div>I generate all the key/date combinations =
that I&#39;m interested in and use multiget_slice to retrieve them, pulling=
 in all the columns for each key (I need all the data, but the number of co=
lumns is small: less than 100). The total number of row keys retrieved will=
 only be 100 or so.</div>

<div><br></div><div>Now it strikes me I could also store this using composi=
te columns, like this:</div><div><div><div><br></div><div><div>key1 { =A020=
11-05-15|c1 : , 2011-5-16|c1 : , 2011-05-15|c2 :,=A02011-05-16|c2 : ,=A0201=
1-05-15|c3=A0: ,=A02011-05-16|c3 : ,=A0...=A0}</div>

</div><div></div><div>key2 { =A02011-05-15|c1 : , 2011-5-16|c1 : , 2011-05-=
15|c2 :, 2011-05-16|c2 : ,=A02011-05-15|c3=A0: , 2011-05-16|c3 : ,=A0...=A0=
}</div></div><div><div>...</div></div></div><div><br></div><div>Then use mu=
ltislice_get again (but with less keys), and use a slice range to only retr=
ieve the dates I&#39;m interested in.</div>

<div><br></div><div>Another alternative I guess would be to use OPP with th=
e first storage approach and get_range_slices, but as I understand this wou=
ld not be great for performance due to keys being clustered together on a s=
ingle node?</div>

<div><br></div><div>So my question is, which approach is best? One downside=
 to the latter I guess is that the number of columns grows without bound (a=
lthough with 2 billion to play with this isn&#39;t gonna be =A0a problem an=
y time soon). Also multiget_slice supports only one slice predicate, so I&#=
39;d guess I&#39;d have to use multiple queries to get multiple date ranges=
.</div>

<div><br></div><div>Anyway, any thoughts/tips appreciated.</div><div><br></=
div><div>Thanks,</div><div>Charles</div><div><br></div>

--0016e6d64527d3245004a34be28e--