I'm using Cassandra as a big graph database, loading large volumes of data live and linking on the fly.
The number of edges grow geometrically with data added, and need to be read to continue linking the graph on the fly.
Consequently, my problem is constrained by:
* Predominantly read - especially when data gets large and reads are quasi random
I have lots of data to plow in, to be read
* Although the problem scale out and possibly all be in RAM, it requires too much kit for the to be viable
So, my findings with Cassandra are:
* Compaction is expensive, I need it but
1) It takes away disk IO from my reads
2) Destroys the file cache
I've not had chance to do extensive tests with the Level db compaction
* Compaction has been too hard to configure historically
* Memory hungry
So for me the biggest features would be
* Cheaper compaction -
* Lower memory usage
* Indexing dynamic colnames (eg Lucene TermEnum against rowkey:colkey)
I do a lot of checking against dynamic colnames
The great features are that redundancy,
and live addition of shards is available out of the box.
I've also experimented with Golden Orb and Triggered updates, I think there is a fair bit that can be achieved in my problem with local data access. Through GoldenOrb and Hadoop writables a managed to get both a BigTable and Pregel access model onto my Cassandra data. It was schema specific, but provided a local compute model.