cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jason kowalewski <jay.kowalew...@gmail.com>
Subject Data modeling for read performance
Date Thu, 17 May 2012 15:55:16 GMT
We have been attempting to change our data model to provide more 
performance in our cluster. 

Currently there are a couple ways to model the data and i was 
wondering if some people out there could help us out. 

We are storing time-series data currently keyed by a user id. This 
current approach is leading to some hot-spotting of nodes likely due 
to the key distribution not being representative of the usage pattern. 
Currently we are using super columns (the super column name is the 
timestamp), which we intend to dispose of as well with this datamodel 
redesign.   

The first idea we had is that we can shard the data using composite row 
keys into time buckets: 

UserId:<TimeBucket> : { 
  <timestamp>:<colname> = <col value1>,
  <timestamp>:<colname2 = <col value2>
... and so on.
}

We can then use a wide row index for tracking these in the future: 
<TimeBucket>: { 
  <userId> = null
} 

This first approach would always have the data be retrieved by the composite 
row key. 

Alternatively we could just do wide rows using composite columns: 

UserId : { 
  <timestamp>:<colname> = <col value1>, 
  <timestamp>:<colname2> = <col value2>

... and so on
}


The second approach would have less granular keys, but is easier to group 
historical timeseries rather than sharding the data into buckets. This second 
approach also will depend solely on Range Slices of the columns to retrieve 
the data. 

Is there a speed advantage in doing a Row point get in the first approach vs 
range scans on these columns  in the second approach? In the first approach 
each bucket would have no more than 200 events. In the second approach we 
would expect the number of columns to be in the thousands to hundreds of 
thousands... Our reads currently (using supercolumns) are PAINFULLY slow - 
the cluster is constantly timing out on many nodes and disk i/o is very high. 

Also, Instead of having each column name as a new composite column is it 
better to serialize the multiple values into some format (json, binary, etc) to 
reduce the amount of disk seeks when paging over this timeseries data? 

Thanks for any ideas out there! 


-Jason


Mime
View raw message