incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Lihn <>
Subject Cassandra data model for financial data
Date Wed, 21 Apr 2010 17:17:51 GMT
I am new to Cassandra. I would like to use Cassandra to store financial data
(time series). Have question on the data model design.

The example here is the daily stock data. This would be a column family
called dailyStockData. The raw key is stock ticker.
Everyday there are attributes like closingPrice, volume, sharesOutstanding,
etc. that need to be stored. There seems to be two ways to model it:

Design 1: Each attribute is a super column. Therefore each date is a column.
So we have:

AAPL -> closingPrice -> { '2010-04-13' : 242, '2010-04-14': 245 }
AAPL -> volume -> { '2010-04-13' : 10.9m, '2010-04-14': 14.4m }

Design 2: Each date is a super column. Therefore each attribute is a column.
So we have:

AAPL -> '2010-04-13' -> { closingPrice -> 242, volume -> 10.9m }
AAPL -> '2010-04-14' -> {closingPrice -> 245, volume -> 14.4m }

The date column / superColumn will need Order Perserving Partitioner since
we are going to do a lot of range queries. Examples are:
Query 1: Give me the data between date1 and date2 for a set of tickers (say,
the 100 tickers in QQQ).
Query 2: More often than not, the query is: Give me the data for the max
available dates (for each ticker) between date1 and date2 in a set of
(Since not every day is traded, and we only want the most recent data, given
a range of dates.)

My questions are:
a. Is there any technical reason to prefer (or must choose) one rather than
the other between Design 1 and Design 2 ?
b. Are both queries possible (and comparable in speed) for the chosen design


View raw message