incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <>
Subject Cassandra as write-behind, Cassandra as Cache
Date Fri, 18 Feb 2011 22:25:04 GMT
Cassandra as dessert topping? Cassandra as floor-wax?

I do apologize for this basket of clueless questions, but I'm
exploring new territory for me.

Overall problem has two datasets with distinct storage characteristics.

The first is a set of data that can fit in memory, but which needs
reliable persistance. In the first instance, there's no need for
replication of this data. One server can have it in memory, it can
update it, but it needs to persist the updates to disk, reliably, so
that it can pick up where it left off. This data is shaped like a hash
table (it's an LSH implementation, if anyone cares) so that there is
on the order of 50-100 'tables', each with 2^13 slots, each slot
containing an array of pairs of strings. In memory on one machine,
it's just 72 ordinary java arrays of references to arrays of strings.
This is enough to accomodate the results of applying it to 1M
documents. The arrays are of bounded size.

To use Cassandra as the persistance mechanism, I would be using it as
a fast log. Each insertion would create an item consisting of a
generated timestamp key, the table index, the slot index, and a paid
of string. Loading up for a reboot would mean reading all the records
and building the memory data structure.

The other part of this is, in some ways, a lot simpler. It's a
key-value map from string keys to blobs, where the blobs derive, by
some serialization or another, from hash tables. (bag-of-words feature
vectors, for the entertained.) The size of this is less bounded, so
I'm inclined to assume that I need to use a read-write persistence
mechanism from the start. However, a lot of it will fit into memory.

Theory 1: use EHCache or something like it.
Theory 2: having it in memory in the Cassandra server is nearly as
good as having it in memory in my jvm, since thrift is thrifty.
Theory 3: I've seen some blogs from a while back about embedding
Cassandra. I'm not clear on the current viability of this, or of the
efficiency thereof.

So, there you have it. Am I on the right mailing list at all, or have
I wandered, as it were, into the wrong sort of bar?

View raw message