incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: Multiget_slice or composite column keys?
Date Mon, 16 May 2011 09:13:59 GMT
I'd stick with the RandomPartitioner until you have a really good reason to change :)

I'd also go with your alternative design with some possible tweaks. 

Consider partitioning the rows  by year or some other sensible value. If you will generally
be getting the most recent data this can reduce the need for cassandra to read SSTables that
contain the row key, but do not contain any required columns. 

Depending on how the data is collected, consider storing all the data collected for a certain
data in a single columns using sometime like JSON. This would allow you to have a single column
for each observation. This makes it easier to use a SliceRange to get say all the observations
from 01/05/2011

If you often want to read certain keys for a single day (or a few days) consider pivoting
the data so the key is the date and the columns are the current row keys. 

Hope that helps. 

Aaron Morton
Freelance Cassandra Developer

On 15 May 2011, at 19:56, Charles Blaxland wrote:

> Hi All,
> New to Cassandra, so apologies if I don't fully grok stuff just yet.
> I have data keyed by a key as well as a date. I want to run a query to get multiple keys
across multiple contiguous date ranges simultaneously. I'm currently storing the date along
with the row key like this:
> key1|2011-05-15 {  c1 : , c2 :,  c3 : ... }
> key1|2011-05-16 {  c1 : , c2 :,  c3 : ... }
> key2|2011-05-15 {  c1 : , c2 :,  c3 : ... }
> key2|2011-05-16 {  c1 : , c2 :,  c3 : ... }
> ...
> I generate all the key/date combinations that I'm interested in and use multiget_slice
to retrieve them, pulling in all the columns for each key (I need all the data, but the number
of columns is small: less than 100). The total number of row keys retrieved will only be 100
or so.
> Now it strikes me I could also store this using composite columns, like this:
> key1 {  2011-05-15|c1 : , 2011-5-16|c1 : , 2011-05-15|c2 :, 2011-05-16|c2 : , 2011-05-15|c3
: , 2011-05-16|c3 : , ... }
> key2 {  2011-05-15|c1 : , 2011-5-16|c1 : , 2011-05-15|c2 :, 2011-05-16|c2 : , 2011-05-15|c3
: , 2011-05-16|c3 : , ... }
> ...
> Then use multislice_get again (but with less keys), and use a slice range to only retrieve
the dates I'm interested in.
> Another alternative I guess would be to use OPP with the first storage approach and get_range_slices,
but as I understand this would not be great for performance due to keys being clustered together
on a single node?
> So my question is, which approach is best? One downside to the latter I guess is that
the number of columns grows without bound (although with 2 billion to play with this isn't
gonna be  a problem any time soon). Also multiget_slice supports only one slice predicate,
so I'd guess I'd have to use multiple queries to get multiple date ranges.
> Anyway, any thoughts/tips appreciated.
> Thanks,
> Charles

View raw message