cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Jones <>
Subject RE: Cassandra data model for financial data
Date Thu, 29 Apr 2010 15:49:40 GMT
At the moment they all have to fit in memory during compaction.  Columns OR SuperColumns (for
one Key).

From: Andrew Nguyen []
Sent: Thursday, April 29, 2010 10:30 AM
Subject: Re: Cassandra data model for financial data

What is the upper limit on the number of super columns?  Is it pretty much the same as for
columns in general?

On Apr 28, 2010, at 10:09 PM, Schubert Zhang wrote:

key : stock ID,  e.g. AAPL+year
column family: closting price and valume, tow CFs.
colum name: timestamp LongType

AAPL+2010-> CF:closingPrice -> {'04-13' : 242, '04-14': 245}
AAPL+2010-> CF:volume -> {'04-13' : 242, '04-14': 245}

On Thu, Apr 22, 2010 at 2:00 AM, Miguel Verde <<>>
On Wed, Apr 21, 2010 at 12:17 PM, Steve Lihn <<>>

Design 1: Each attribute is a super column. Therefore each date is a column. So we have:

AAPL -> closingPrice -> { '2010-04-13' : 242, '2010-04-14': 245 }
AAPL -> volume -> { '2010-04-13' : 10.9m, '2010-04-14': 14.4m }
I would suggest not using this design, as each query involving an attribute will pull all
dates for that attribute into memory on the server.  i.e. getting the closingPrice for AAPL
on '2010-04-13' would pull all closing prices for AAPL across all dates into memory.

Design 2: Each date is a super column. Therefore each attribute is a column. So we have:

AAPL -> '2010-04-13' -> { closingPrice -> 242, volume -> 10.9m }
AAPL -> '2010-04-14' -> {closingPrice -> 245, volume -> 14.4m }

The date column / superColumn will need Order Perserving Partitioner since we are going to
do a lot of range queries.

Partitioners split up keys between nodes, the partitioner you use has no effect on your ability
to query columns in a row.

Examples are:
Query 1: Give me the data between date1 and date2 for a set of tickers (say, the 100 tickers
in QQQ).
You could use for this.

Query 2: More often than not, the query is: Give me the data for the max available dates (for
each ticker) between date1 and date2 in a set of tickers.
(Since not every day is traded, and we only want the most recent data, given a range of dates.)
A allows you to specify limits and ordering
for columns you are slicing.

View raw message