incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trevor Francis <>
Subject Re: Column Family per User
Date Wed, 18 Apr 2012 21:20:26 GMT
Regarding Rotating, I was thinking about the concept of log rotate, where you write to a file
for a specific period of time, then you create a new file and write to it after a specific
set of time. So yes, it closes a row and opens another row.

Since I will be generating analytics every 15 minutes, its would make sense to me to bucket
a row every 15 minutes. Since I would only have at most 500 users, this doesn't strike me
as too many rows in a given day (48,000). Potential downsides to doing this?

Since I am analyzing 20 separate data points for a given log entry, it would make sense that
querying based upon a specific metric (wind, rain, sunshine) would be easier if the data was
separated. However, couldn't we build composite columns for time and value where all that
would be left in "data"?

So composite row key would be:


And Columns would be: 




Data would be:

Our the columns could be 12:22:23.293


Or something like that….Am I headed in the right direction?

Trevor Francis

On Apr 18, 2012, at 3:10 PM, Janne Jalkanen wrote:

> Hi!
> A simple model to do this would be
> * ColumnFamily "Data"
>   * key: userid
>   * column: Composite( timestamp, entrytype ) = value
> For example, userid "janne" would have columns 
>    (2012-04-12T12:22:23.293,speed) = 24;
>    (2012-04-12T12:22:23.293,temperature) = 12.4
>    (2012-04-12T12:22:23.293,direction) = 356;
>    (2012-04-12T12:22:23.295,speed) = 24.1;
>    (2012-04-12T12:22:23.295,temperature) = 12.3
>    (2012-04-12T12:22:23.295,direction) = 352;
> Note that Cassandra does not require you to know which columns you're going to put in
it (unlike MySQL). You can declare types ahead if you know what they are, but if you'll need
to start adding a new column, just start writing it and Cassandra should do the right things.
> However, there are a few points which you might want to consider
> * Using ISO dates for timestamps have a minor problem: if two events occur during the
same millisecond, they'll overwrite each other. This is why most time series in C* use TimeUUIDs,
which contain a millisecond timestamp + a random component. (
> * This will generate timestamp*entrytype columns. So for 2500 entries/second and 20 columns
this means about 2500*20 = 50000 wps (granted that you will most probably batch the writes
though). You will need to performance test your cluster to see if this schema is right for
you. If not, you might want to try and see how you can distribute the keys differently, e.g.
by bucketing the data somehow. However, I recommend that you build a first-shot of your app
structure, then load test it until it breaks and that should give you pretty good understanding
of what exactly cassandra is doing.
> To do then analytics multiple options are possible; a popular one is to run MapReduce
queries using a tool like Apache Pig on regular intervals. DataStax has good documentation
and you probably want to take a look at their offering as well, since they have pretty good
Hadoop/MapReduce support for Cassandra.
> CLI syntax to try with:
> create keyspace DataTest with placement_strategy='org.apache.cassandra.locator.SimpleStrategy'
and strategy_options = {replication_factor:1};
> use DataTest;
> create column family Data with key_validation_class=UTF8Type and comparator='CompositeType(UUIDType,UTF8Type)';
> Then start writing using your fav client.
> /Janne
> On Apr 18, 2012, at 22:36 , Trevor Francis wrote:
>> Janne,
>> Of course, I am new to the Cassandra world, so it is taking some getting used to
understand how everything translates into my MYSQL head.
>> We are building an enterprise application that will ingest log information and provide
metrics and trending based upon the data contained in the logs. The application is transactional
in nature such that a record will be written to a log and our system will need to query that
record and assign two values to it in addition to using the information to develop trending
>> The logs are being fed into cassandra by Flume.
>> Each of our users will be assigned their own piece of hardware that generates these
log events, some of which can peak at up to 2500 transactions per second for a couple of hours.
The log entries are around 150-bytes each and contain around 20 different pieces of information.
Neither us, nor our users are interested in generating any queries across the entire database.
Users are only concerned with the data that their particular piece of hardware generates.

>> Should I just setup a single column family with 20 columns, the first of which being
the row key and make the row key the username of that user?
>> We would also need probably 2 more columns to store Value A and Value B assigned
to that particular record.
>> Our metrics will be be something like this: For this particular user, during this
particular timeframe, what is the average of field "X?" And then store that value, which we
can generate historical trending over the course a week. We will do this every 15 minutes.

>> Any suggestions on where I should head to start my journey into Cassandra for my
particular application?
>> Trevor Francis
>> On Apr 18, 2012, at 2:14 PM, Janne Jalkanen wrote:
>>> Each CF takes a fair chunk of memory regardless of how much data it has, so this
is probably not a good idea, if you have lots of users. Also using a single CF means that
compression is likely to work better (more redundant data).
>>> However, Cassandra distributes the load across different nodes based on the row
key, and the writes scale roughly linearly according to the number of nodes. So if you can
make sure that no single row gets overly burdened by writes (50 million writes/day to a single
row would always go to the same nodes - this is in the order of 600 writes/second/node, which
shouldn't really pose a problem, IMHO). The main problem is that if a single row gets lots
of columns it'll start to slow down at some point, and your row caches become less useful,
as they cache the entire row.
>>> Keep your rows suitably sized and you should be fine. To partition the data,
you can either distribute it to a few CFs based on use or use some other distribution method
(like "user:1234:00" where the "00" is the hour-of-the-day.
>>> (There's a great article by Aaron Morton on how wide rows impact performance
at, but as always, running your
own tests to determine the optimal setup is recommended.)
>>> /Janne
>>> On Apr 18, 2012, at 21:20 , Trevor Francis wrote:
>>>> Our application has users that can write in upwards of 50 million records
per day. However, they all write the same format of records (20 fields…columns). Should
I put each user in their own column family, even though the column family schema will be the
same per user?
>>>> Would this help with dimensioning, if each user is querying their keyspace
and only their keyspace?
>>>> Trevor Francis

View raw message