cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chin Ko <>
Subject Selecting rows efficiently from a Cassandra CF containing time series data
Date Tue, 11 Dec 2012 14:23:54 GMT
I would like to get some opinions on how to select an incremental range of
rows efficiently from a Cassandra CF containing time series data.

We have a web application that uses a Cassandra CF as logging storage. We
insert a row into the CF for every "event" of each user of the web
application. The row key is timestamp+userid. The column values are
unstructured data. We only insert rows but never update or delete any rows
in the CF.

Data volume:
The CF grows by about 0.5 million rows per day. We have a 4 node cluster
and use the RandomPartitioner to spread the rows across the nodes.

There is a need to transfer the Cassandra data to another relational
database periodically. Due to the large size of the CF, instead of
truncating the relational table and reloading all rows into it each time,
we plan to run a job to select the "delta" rows since the last run and
insert them into the relational database.

We would like to have some flexibility in how often the data transfer job
is done. It may be run several times each day, or it may be not run at all
on a day.

Options considered:
- We are using RandomPartitioner, so range scan by row key is not feasible.
- Add a secondary index on the timestamp column, but reading rows via
secondary index still requires an equality condition and does not support
range scan.
- Add a secondary index on a column containing the date and hour of the
timestamp. Iterate each hour between the time job was last run and now.
Fetch all rows of each hour.

I would appreciate any ideas of other design options of the Cassandra CF to
enable extracting the rows efficiently.

Besides Java, has anyone used any ETL tools to do this kind of delta
extraction from Cassandra?


View raw message