cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trevor Francis <trevor.fran...@tgrahamcapital.com>
Subject Re: Column Family per User
Date Wed, 18 Apr 2012 19:48:43 GMT
Janne,


Of course, I am new to the Cassandra world, so it is taking some getting used to understand
how everything translates into my MYSQL head.

We are building an enterprise application that will ingest log information and provide metrics
and trending based upon the data contained in the logs. The application is transactional in
nature such that a record will be written to a log and our system will need to query that
record and assign two values to it in addition to using the information to develop trending
metrics. 

The logs are being fed into cassandra by Flume.

Each of our users will be assigned their own piece of hardware that generates these log events,
some of which can peak at up to 2500 transactions per second for a couple of hours. The log
entries are around 150-bytes each and contain around 20 different pieces of information. Neither
us, nor our users are interested in generating any queries across the entire database. Users
are only concerned with the data that their particular piece of hardware generates. 

Should I just setup a single column family with 20 columns, the first of which being the row
key and make the row key the username of that user?

We would also need probably 2 more columns to store Value A and Value B assigned to that particular
record.

Our metrics will be be something like this: For this particular user, during this particular
timeframe, what is the average of field "X?" And then store that value, which we can generate
historical trending over the course a week. We will do this every 15 minutes. 

Any suggestions on where I should head to start my journey into Cassandra for my particular
application?


Trevor Francis


On Apr 18, 2012, at 2:14 PM, Janne Jalkanen wrote:

> 
> Each CF takes a fair chunk of memory regardless of how much data it has, so this is probably
not a good idea, if you have lots of users. Also using a single CF means that compression
is likely to work better (more redundant data).
> 
> However, Cassandra distributes the load across different nodes based on the row key,
and the writes scale roughly linearly according to the number of nodes. So if you can make
sure that no single row gets overly burdened by writes (50 million writes/day to a single
row would always go to the same nodes - this is in the order of 600 writes/second/node, which
shouldn't really pose a problem, IMHO). The main problem is that if a single row gets lots
of columns it'll start to slow down at some point, and your row caches become less useful,
as they cache the entire row.
> 
> Keep your rows suitably sized and you should be fine. To partition the data, you can
either distribute it to a few CFs based on use or use some other distribution method (like
"user:1234:00" where the "00" is the hour-of-the-day.
> 
> (There's a great article by Aaron Morton on how wide rows impact performance at http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/,
but as always, running your own tests to determine the optimal setup is recommended.)
> 
> /Janne
> 
> On Apr 18, 2012, at 21:20 , Trevor Francis wrote:
> 
>> Our application has users that can write in upwards of 50 million records per day.
However, they all write the same format of records (20 fields…columns). Should I put each
user in their own column family, even though the column family schema will be the same per
user?
>> 
>> Would this help with dimensioning, if each user is querying their keyspace and only
their keyspace?
>> 
>> 
>> Trevor Francis
>> 
>> 
> 


Mime
View raw message