>> column name: 2012-04-12T12:22:23.293/55/45/10 

realize that the date would be the binary form, not text based.. 

>>  At peak, hourly rotation would decrease the row size to 180M data points vs. 1.2B.

The theoretical limit is 2B columns. But i doubt anyone recommends getting anywhere near that. Others would probably have better advice on 'ideal row sizes', i'd  think i'd want to be in the couple 100,000 range. But you probably need to prove that out for your own data, systems.


----- Original Message -----
From: "Dave Brosius" <dbrosius@mebigfatguy.com>
Sent: Wed, April 18, 2012 16:58
Subject: Re: Column Family per User


Yes in this cassandra model, time wouldn't be a column value, it would be part of the column name. Depending on how you want to access your data (give me all data points for time X) and how many separate datapoints you have for time X, you might consider packing all the data for a time in one column thru composite columns

column name: 2012-04-12T12:22:23.293/55/45/10 

(where / is a human readable representation of the composite separator) in this case there wouldn't actually be a value, the data is just encoded in the column name.

Obviously if you are storing dozens of separate datapoints for a timestamp than this gets out of hand quickly, and perhaps you need to go back to column names with time/fieldname format with a real value.

the advantage tho of the composite key is that you eliminate all that constant blather about 'Wind' 'Rain' 'Sunshine' in your data and only hold real data. (granted compression will probably help here, but not having it all is even better).

as for row size, obviously that takes some experimentation on you part. You can bucket a row to be any time frame you want. If you feel that 15 minutes is the correct length of time given the amount of data you will write, then use 15 minutes. It it's 1 hour, use 1 hour. The only thing you have to figure out is a 'bucket time' definition that you understand, likely it's the timestamp of when that time period starts.

As for 'rotating the row', perhaps it's just semantics, but there really is no such concept. You are at some point in time, and you want to write some data to the database.

The steps are

1) get the user
2) get the timestamp of the current bucket based on 'now'
3) build a composite key
4) insert the data with that key

Whether that row existed before or is a new row has no bearing on your client code.



----- Original Message -----
From: "Trevor Francis" <trevor.francis@tgrahamcapital.com>
Sent: Wed, April 18, 2012 16:42
Subject: Re: Column Family per User

I am trying to grasp this concept..so let me try a scenario.
 
Lets say I have 5 data points being captured in the log file. Here would be a typical table schema in mysql.
 
Id, Username, Time, Wind, Rain, Sunshine
 
Select * from table; would reveal:
 
1, george, 2012-04-12T12:22:23.293, 55, 45, 10
2, george, 2012-04-12T12:22:24.293, 45, 25, 25
3, george, 2012-04 -12T12:22:25.293, 35, 15, 11
4, george, 2012-04-12T12:22:26.293, 55, 65, 16
5, george, 2012-04-12T12:22:27.293, 12, 5, 22
 
And it would just continue from there adding rows as log files are imported.
 
A select * from table where sunshine="16" would yield:
 
4, george, 2012-04-12T12:22:26.293, 55, 65, 16
 
  
Now, you are saying that in Cassandra, Instead of having a bunch of rows containing ordered information (which is what I would have), I would have a single row with multiple columns:
 
George | 2012-04-12T12:22:23.293, wind=55 | 2012-04-12T12:22:23.293, Rain=45 | 2012-04-12T12:22:23.293, Sunshine=10 | .....continued.
 
So George would be the row and the columns would be the actual data. The data would be oriented horizontally, vs vert ically (mysql).
 
So for instance, log generation on our application isn't linear as it peaks at certain times of the day. A user generating at peak 2500 would typically generate 60M log entries per day. Multiply that times 20 data pieces and you are looking at 1.2B Columns in a given day for that user. Assuming we batches the writes every minute, can a node handle this sort of load?
 
Also, can we "rotate" the row every day? Would it make more sense to rotate hourly? At peak, hourly rotation would decrease the row size to 180M data points vs. 1.2B.
 
At max, we may only have 500 users on our platform. That means that if we did hourly row rotation, that would be 12,000 rows per day.with the maximum column size of 180M columns.
 
 
Am I grasping this concept properly?

Trevor Francis
 

On Apr 18, 2012, at 3:06 PM, Dave Brosius wrote:


Your design should be around how you want to query. If you are only querying by user, then having a user as part of the row key makes sense. To manage row size, you should think of a row as being a bucket of time. Cassandra supports a l arge (but not without bounds) row size. To manage row size you might say that this row is for user fred for the month of april, or if that's too much perhaps the row is for user fred for the day 4/18/12. To do this you can use composite keys to hold both pieces of information in the key. (user, bucketpos)

The nice thing is that once the time period ha s come and gone, that row is complete, and you can perform background jobs against that row and store summary information for that time period.


----- Original Message -- -- -
From: "Trevor Francis" <trevor.francis@tgrahamcapital.com>
Sent: Wed, April 18, 2012 15:48
Subject: Re: Column Family per User

Janne,
 
 
Of course, I am new to the Cassandra world, so it is taking some getting used to understand how everything translates into my MYSQL head.
 
We are building an enterprise application that will ingest log inf ormation and provide metrics and trending based upon the data contained in the logs. The application is transactional in nature such that a record will be written to a log and our system will need to query that record and assign two values to it in addition to using the information to develop trending metrics.&n bsp;
 
The logs are being fed into cassandra by Flume.
 
Each of our users will be assigned their own piece of hardware that generates these log events, some of which can peak at up to 2500 transactions per second for a couple of hours. The log entries are around 150-bytes each and contain around 20 different pieces of information. Neither us, nor our users are interested in generating any queries across the entire database. Users are only concerned with the data that their particular piece of hardware generates. 
 
Should I just setup a single column family with 20 columns, the first of which bei ng the row key and make the row key the username of that user?
 
We would also need probably 2 more columns to store Value A and Value B assigned to that particular record.
 
Our metrics will be be something like this: For this particular user, during this part icul ar timeframe, what is the average of field "X?" And then store that value, which we can generate historical trending over the course a week. We will do this every 15 minutes. 
 
Any suggestions on where I should head to start my journey into Cassandra for my particular application?
 

Trevor Francis
 

On Apr 18, 2012, at 2:14 PM, Janne Jalkanen wrote:

 
Each CF takes a fair chunk of memory regardless of how much data it has, so this is probably not a good idea, if you have lots of users. Also using a single CF means that compression is likely to work better (more redundant data).
 
However, Cassandra distributes the load across different nodes based on the row key, and the writes scale roughly linearly according to the number of nodes. So if you can make sure that no single row gets overly burdened by writes (50 million writes/day to a single row would always go to the same nodes - this is in the order of 600 writes/second/node, which shouldn't really pose a problem, IMHO). The main problem is that if a single row gets lots of columns it'll start to slow down at some point, and your row caches become less useful, as they cache the entire row.
 
Keep your rows suitably sized and you should be fine. To partition the data, you can either distribute it to a few CFs based on use or use some other distribution method (like "user:1234:00" where the "00" is the hour-of-the-day.
 
(There's a great article by Aaron Morton on how wide rows impact performance at http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/, but as always, running your own tests to determine the optimal setup is recommended.)
 
/Janne

On Apr 18, 2012, at 21:20 , Trevor Francis wrote:

Our application has users that can write in upwards of 50 million records per day. However, they all write the same format of records (20 fields.columns). Should I put e ac h user in their own column family, even though the column family schema will be the same per user?
 
Would this help with dimensioning, if each user is querying their keyspace and only their keyspace?
 

Trevor Francis