Hi,

Jim, it seems we share a very similar use case with highly variable rates in the timeseries data sources we archive. When I first started I was preocupied about this very big difference in row lengths. I was using a schema similar to the one Aaron mentioned: for each data source I had a row with a row key = <source:timestamp> and col name = <timestamp>.

At the time I was using 0.7 which did not have counters (or at least I was not aware of them). I used to count the number of columns in every row on the inserting client side and when a fixed threshold was reached for a certain data source (row key) I would generate a new row key for that data source with the following structure <source:timestamp> where timestamp = the timestamp of the last value added in the old row (this is the minimum amount of info needed to reconstruct a temporal query across multiple rows). At this point I would reset the counter for the data source to zero to start again. Of course I had to keep track of the row keys in a CF and also flush the counters in another CF whenever the client went down, so I can rebuild a cache of counters when the client came back on again.

I can say this approach was a pain and I eventually replaced it with a bucketing scheme similar to what Aaron described, with a fixed bucket across all rows. As you can see, unfortunately, I am still trying to choose a bucket size that is the best compromise for all rows. But it is indeed a lot easier if you can generate all the possible keys for a certain data source on the retrieving client side. If you want more details of how I do this let me know.

So, as I see from Aaron's suggestion, he's more in favour of pure uniform time bucketing. On wednesday I'm going to attend http://www.cassandra-eu.org/ and hopefully I will get more opinions there. I'll follow up on this thread if something interesting comes up!

Cheers,
Alex



On Mon, Mar 26, 2012 at 4:10 AM, aaron morton <aaron@thelastpickle.com> wrote:
There is a great deal of utility in been able to derive the set of possible row keys for a date range on the client side. So I would try to carve up the time slices with respect to the time rather than the amount of data in them. This may not be practical but I think it's very useful. 

Say you are storing the raw time series facts in the Fact CF, and the row key is something like <source:datetime> (may want to add a bucket size see below) and the column name is the <isotimestamp>. The data source also has a bucket size stored something, such as hourly, daily, month. 

For an hourly bucket source, the datetime in the row keys is something like "2012-01-02T13:00" (one for each hour) for a daily it's something like "2012-01-02T00:00" . You can then work out the set of possible keys in a date range and perform multi selects against those keys until you have all the data. 

If you change the bucketing scheme for a data source you need to keep a history so you can work out which keys may exist. That may be a huge pain. As an alternative create a custom secondary, as you discussed, of all the row keys for the data source. But continue to use a consistent time based method for partitioning time ranges if possible. 
  
Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 24/03/2012, at 3:22 AM, Jim Ancona wrote:

I'm dealing with a similar issue, with an additional complication. We are collecting time series data, and the amount of data per time period varies greatly. We will collect and query event data by account, but the biggest account will accumulate about 10,000 times as much data per time period as the median account. So for the median account I could put multiple years of data in one row, while for the largest accounts I don't want to put more one day's worth in a row. If I use a uniform bucket size of one day (to accomodate the largest accounts) it will make for rows that are too short for the vast majority of accounts--we would have to read thirty rows to get a month's worth of data. One obvious approach is to set a maximum row size, that is, write data in a row until it reaches a maximum length, then start a new one. There are two things that make that harder than it sounds:
  1. There's no efficient way to count columns in a Cassandra row in order to find out when to start a new one. 
  2. Row keys aren't searchable. So I need to be able to construct or look up the key to each row that contains a account's data. (Our data will be in reverse date order.)

Possible solutions:

  1. Cassandra counter columns are an efficient way to keep counts
  2. I could have a "directory" row that contains pointers to the rows that contain an account data

(I could probably combine the row directory and the column counter into a single counter column family, where the column name is the row key and the value is the counter.) A naive solution would require reading the directory before every read and the counter before every write--caching could probably help with that. So this approach would probably lead to a reasonable solution, but it's liable to be somewhat complex. Before I go much further down this path, I thought I'd run it by this group in case someone can point out a more clever solution.

Thanks,

Jim

On Thu, Mar 22, 2012 at 5:36 PM, Alexandru Sicoe <adsicoe@gmail.com> wrote:
Thanks Aaron, I'll lower the time bucket, see how it goes.

Cheers,
Alex


On Thu, Mar 22, 2012 at 10:07 PM, aaron morton <aaron@thelastpickle.com> wrote:
Will adding a few tens of wide rows like this every day cause me problems on the long term? Should I consider lowering the time bucket?
IMHO yeah, yup, ya and yes.


From experience I am a bit reluctant to create too many rows because I see that reading across multiple rows seriously affects performance. Of course I will use map-reduce as well ...will it be significantly affected by many rows?
Don't think it would make too much difference. 
range slice used by map-reduce will find the first row in the batch and then step through them.

Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 22/03/2012, at 11:43 PM, Alexandru Sicoe wrote:

Hi guys,

Based on what you are saying there seems to be a tradeoff that developers have to handle between:

                               "keep your rows under a certain size" vs "keep data that's queried together, on disk together"

How would you handle this tradeoff in my case:

I monitor about 40.000 independent timeseries streams of data. The streams have highly variable rates. Each stream has its own row and I go to a new row every 28 hrs. With this scheme, I see several tens of rows reaching sizes in the millions of columns within this time bucket (largest I saw was 6.4 million). The sizes of these wide rows are around 400MBytes (considerably > than 60MB)

Will adding a few tens of wide rows like this every day cause me problems on the long term? Should I consider lowering the time bucket?

From experience I am a bit reluctant to create too many rows because I see that reading across multiple rows seriously affects performance. Of course I will use map-reduce as well ...will it be significantly affected by many rows?

Cheers,
Alex

On Tue, Mar 20, 2012 at 6:37 PM, aaron morton <aaron@thelastpickle.com> wrote:
The reads are only fetching slices of 20 to 100 columns max at a time from the row but if the key is planted on one node in the cluster I am concerned about that node getting the brunt of traffic.
What RF are you using, how many nodes are in the cluster, what CL do you read at ?

If you have lots of nodes that are in different racks the NetworkTopologyStrategy will do a better job of distributing read load than the SimpleStrategy. The DynamicSnitch can also result distribute load, see cassandra yaml for it's configuration. 

I thought about breaking the column data into multiple different row keys to help distribute throughout the cluster but its so darn handy having all the columns in one key!!
If you have a row that will continually grow it is a good idea to partition it in some way. Large rows can slow things like compaction and repair down. If you have something above 60MB it's starting to slow things down. Can you partition by a date range such as month ?

Large rows are also a little slower to query from

If most reads are only pulling 20 to 100 columns at a time are there two workloads ? Is it possible store just these columns in a separate row ? If you understand how big a row may get may be able to use the row cache to improve performance.  

Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 20/03/2012, at 2:05 PM, Blake Starkenburg wrote:

I have a row key which is now up to 125,000 columns (and anticipated to grow), I know this is a far-cry from the 2-billion columns a single row key can store in Cassandra but my concern is the amount of reads that this specific row key may get compared to other row keys. This particular row key houses column data associated with one of the more popular areas of the site. The reads are only fetching slices of 20 to 100 columns max at a time from the row but if the key is planted on one node in the cluster I am concerned about that node getting the brunt of traffic.

I thought about breaking the column data into multiple different row keys to help distribute throughout the cluster but its so darn handy having all the columns in one key!!

key_cache is enabled but row cache is disabled on the column family.

Should I be concerned going forward? Any particular advice on large wide rows?

Thanks!