We currently run our Cassandra deployment with multiple independent clusters.  The clusters are totally self contain in terms of redundancy and independent from each others.  We have a "sharding "layer higher in our stack to dispatch the requests to the right application stack and this stack connects to his associated Cassandra cluster. All the cassandra clusters are identical in terms of hosted keyspaces, column families, replication factor ...

At this point I am investigating ways to build a central cassandra cluster that could contain all the data from all the other cassandra clusters and I am wondering how to best do it.  The goal is to have a global view of our data and to be able to do some massive crunching on it.

For sure we can build some ETL type of job that would figure out the data that was updated, extract it, and load it to the central cassandra cluster.  From this mailing list I found this Github project that is doing something similar by looking at the commit logs: https://github.com/carloscm/cassandra-commitlog-extract

But is there other options around using a custom replication strategy?  Any other general suggestions ?





Francois Richard